Data Pipeline Frameworks: The Dream and the Reality | Beeswax

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

you [Music] hi everyone so quick show of hands before we get into this how many of you are using some framework for managing data pipelines okay and how many are using shell scripts and cron okay all right so this should be relevant to a lot of folks I'm mark Weiss I'm a senior engineer at beeswax and we're an online advertising company here in New York and my background just really quickly I've been sort of an ETL and data processing kind of space pretty heavily for like last ten or twelve years so I've seen a couple different permutations of this problem space so we're going to talk about a little bit about beeswax our what our business does how much data generates as Rick mentioned it's very high volumes of data what our pipelines are doing right now we'll drill into an example and the idea is to show you that using two two very different frameworks and trying to leverage what they promise and what we hope they'll give us what it looks like to actually try to build that example and those are gonna be air flow and eight AWS data pipeline and we'll kind of hit the reality checks of what they do and don't do for us and then ask and answer the question of like okay well if they're not giving us all these things is it really worth committing to a framework what do we get from it and kind of wrap up by saying when you put all this together and do the lifting you need to do around the framework you're living the dream so first let's talk about these rocks just a bit what we do and the data data flows so we're a programmatic advertising company as we've just mentioned so what that means is you go to a website you open an app there's an ad slot there the publisher publishes a description of the ad it gets sent to these exchanges who roll up millions of these a second and Ford them on to buyers of ads like us so we're what's called a real-time bidder that's our platform so our customers are advertisers were buying ads for them from the point of view of this talk what's interesting is that this generates a lot of data so we're just breathing in this firehose of millions of requests a second on the front of our system and we have to respond to them all really quickly just as a side note that's one of the cool problems we have to solve is a very high scale decisioning very fast so this generates a lot of data so the bid requests are all these ads coming in when we bid we log that when we win that's called an impression we log that later on a user clicks that's a different event maybe two weeks three weeks later they go visit a website that's a different event so we've all these asynchronous data streams that are coming into our system so our main ETL right now is written in data pipeline and lands data in redshift and of course probably like many folks here that's an architecture that we're evolving past I would say as a side note but right now that's what happens and so one of the challenges there is dealing with all the scale and another challenge is joining all these asynchronous streams and you know also we have to do some tricky things like figure out that three weeks ago you saw an ad and now you're visiting a landing page and those two things are connected right so that's kind of the current ETL you know and then the okay yeah so this is also showing there's a little out order sorry about that this is showing the scale I was mentioning so we've got you know billions of the auctions coming in and tens of billions of requests we log all the requests and we've got hundreds of millions of impressions and this data isn't just for internal use this is core to our product and our differentiation so we have to handle all this gracefully correctly with low latency the data has to be fresh and that's because we reporting is key in this domain for customers they need to understand how their campaigns are doing but also one of the differentiations of our businesses we deliver all this data back to our customers as part of a flat platform fee and so it's the data itself is a key part of the product so the other set of pipelines are newer of course that's the airflow piece there as we were disgruntled the data pipeline and we're looking for something better so they are we are basically just building that's pretty typical use case I hope that some folks here can relate to extracting data from redshift landing it in s3 building a data Lake / that that's holding the data for a lot longer than we hold it in redshift so that's using hive to do that so basically right so one more slide here just showing the growth of the data and why we've been focusing on the data pipelines you know why we were looking for a better solution than data pipeline and looking for a whole next-generation way to handle this one reason I mention the data's core to our business and its strategic and we know we're going to need to keep investing in that and it's differentiation so we have to be nimble there and be able to scale second reason is just the business has been scaling so this is this year and this is the basically the top-line metric of impressions so this is ads this is what we do we buy ads so we're basically at three acts start of the third quarter from the beginning of the year and this is a seasonal business so we know it's gonna get even higher in q4 and so on so we've good problem to have we have to scale our pipelines so let's look at this example let's revisit the data Lake example and this is basically what I had to do was say okay so what technology should we use to build this data Lake and you know again just really quickly we need to extract the data from redshift landed in s3 that's one step in this pipeline and then another step is to basically put a hive table over that so data pipeline versus air flow so I looked at a bunch of different choices and it kind of boiled down to these two data pipeline we'd already been running for two-plus years we know how to run it but we've the way that we ran run it is we built a lot of our own tooling around it so that ap eyes are pretty painful even something simple as like figuring out if a pipeline is already running requires some gymnastics and so on with the api's so and the UI is terrible the retries story is ok but not great so it works we know how to use it but we weren't really happy with it and we felt like we're spending a lot of time on it it wasn't maybe a way to scale going forward so we landed on airflow as Rick mentioned like a lot of people are doing it's kind of gotten it's clearly gotten the traction at this point it's the shiny object right now it made it to Apache Incubator and it's actually now the first hosted pipeline service on Google Cloud and so on so it was kind of the de-facto other choice so we said let's see what we can do with this so this decision felt a little bit like this the first joke of the presentation right data pipeline was like a typewriter like we know we can put paper in and hunt and peck and type something that's about what we can get out of it we're hoping that with a higher learning curve and the shiny new object can get a much powerful general purpose tool for going forward right so let's drill down now and like build haha bill this is executable code build the pipeline with these two choices and sort of see where we land so leaving aside kind of all the other aspects of pipeline that are specific to just how pipelines work and looking at our specific requirements for this pipeline we know we need to do these two things as I mentioned we need to unload the data from redshift s3 and we need to put a hive table over it so one of the promises of these pipelines that you sort of hope for especially with air flow which has dozens of canned operators that it provides and operators of the code that actually perform the steps in the pipeline you know you're looking at it hoping to get some code reuse there and not have to write this code yourself and so the first question I wanted to answer is was that going to end up being true or not so let's look at air flow first and you know the first thing that I saw was oh look it's open source it's in github and there's exactly the operator we need redshift s3 so that seemed like a really promising start but then when we drill into the code right away we see problems and this can be kind of the theme of this part of the talk you know for example the way it was handling credentials wasn't the way we wanted to there was no support for SSM which is the way we do that everywhere and want to do that so even right there in the guts of the code it's not doing things the way we want to and we would be looking at forking at managing Virgen something like that another example of this was at that time the version of airflow we're using then didn't even support a passing an argument to control whether or not when you extract the data to CSV there would be headers or not this is like a extremely basic feature that you're always gonna want to control and at that time it wasn't supported so that's just a small example of the brittleness so the strength of this model is you have these reusable operators you compose on top of them at this high declarative level and you can build out all whole applications but the weakness is if the operator doesn't allow you to parameterize it the way you want that's the only way you can change these behaviors passing arguments to it when it runs so if that's not exposed to you you're stuck so similar and even worse problems with data pipeline and this is kind of cheating because we'd already solved all these problems at this time by basically writing all our own custom way of putting pipeline Python shell activities out and running them and data pipeline but let's pretend so you know if you look at what they offer out of the box there these you there's this UI a sort of visual builder which is basically a non-starter because you're pasting your code in there so that's not going to work in terms of deploying version code you can point it at scripts on s3 and then you're building your own tooling to extract from your repo and land the thing in s3 so again you're not getting this off-the-shelf out-of-the-box kind of experience furthermore you know they're just giving you kind of these generic operators that don't implement anything you used it's a way to run sequel so you still have to write the sequel so in that way you're getting even less of an implementation than with air flow right so basically we took a look at these frameworks we tried to do something with them and the first thing we realized is the operators they're giving us aren't giving us what we need and as I kept digging in I realized this was not just specific to these implementations but this is basically an attribute of how data pipelines work the this model of building applications and that this is going to be true no matter framework you pick so you know let me back that up just a bit with some more examples so just summarizing again sort of what we found initially we couldn't use these operators the way we wanted because there were things we wanted them to do that they either couldn't do or they weren't doing them the way we wanted these are the specifics so authentication headers deployment versioning but you know then I looked at the rest of the way we build all our applications at beeswax and you know we were a business that several years old we've been using Python in production for years air flows all Python so the hope was we've have the seamless integration path but you know the truth is you operate a big part of your applications running in any cloud environment or really anywhere is all the rest of what makes them a production so how you do monitoring how you do logging how you handle errors all these things and you want your pipelines to run the same way ideally because they're it's a lot less valuable to have this general-purpose system to declaratively put computation out in the cloud if you have to do everything differently just for this thing it's much nicer if you can leverage and operate it the way you run the rest of your stuff right so in this example we always like to omit certain metrics to in flux we rely heavily on it for dashboards for monitoring for on-call support all all that kind of stuff also we have metrics driven alerts so we need to be able to do that and so there is some support for this kind of thing in airflow with hooks hooks let you wrap events and you can write custom hooks but you're limited to responding to the events that that you can respond to you can't just put code exactly where you want it in the operator implementation you and also then you writing your own hooks you're now managing and maintaining your own code once again already even if you're trying to reuse their code and you know it's a so basically you're not getting all that benefit that you want and then you know the second example of sort of alerting this page our duty example on the right let's say you want to emit a message from right from an exception block because that's where you have all the context and you can write that rich message of exactly what failed and omit the exception itself stacked trace all that stuff you know you don't have that kind of hook in any of these implementations you need to be writing your own code to have that kind of control right so and the last thing is you know we want to import anti-gravity and fly right of course you know joking aside we had as I said we had all this Python code to do and we'd solved all lots of problems already right so we have our three clients and paid read clients on and on and on and we want to reuse all that and we you know we want to just import that in import etc import beeswax dot and use it but there's no way to do that at all with AWS data pipeline and with airflow you have a plug-in API you can wrap things and implement their entry point their plugin entry point and then again you're maintaining shims you're maintaining the separate way of integrating your code you don't have a path to just build these applications just like all the rest so that's the first reality check a rather probably long way around making the case which is that the more you look at this operator code you're just gonna the more reasons you're gonna find that it's not going to work for you so I would assert that if you're looking at this problem space and you're looking at building out a data pipeline platform for your company you should expect implement your own operators and in fact just embrace that right and you know again this is why because this is code that is the only way you can parameterize its behavior is by passing arguments to it when it runs so you know it unless everything is exposed to you exactly the way you need it you're not going to have the control you need so we just decided to take this problem head-on through a simple inheritance pattern right so we just basically put our base operator over the airflow base operator and wrapped up all the plumbing that needs to make operators work with airflow pick up the arguments that in the way it wants pass them in and all that in our base so our users don't have to think about that and then we know everything's gonna work with airflow flow is gonna be able to trigger our operators and then you know the nice thing about this also is this gives us a really nice to level pattern for laying in the general services as that I was mentioning so for example if we want the same metrics to be written by every operator with the same dimensions tagged on them and so on same semantics we can put that in the base class and then now for free all the operators we write are gonna emit the same metrics so for example just simple things like duration of the tasks and success or failure count those kinds of things and then likewise if there's services that we want to lay in to a specific operator that required its context we can put them at that level so that's just an example of how to address this problem an example of how we addressed it but the idea is you should expect to address in a way that works for your needs and works with what you already have right so second example of where you're gonna basically need to do lifting yourself is sorry is deployment packaging deployment all that stuff right so here's our example again just as a refresher you know so the basic idea is these pipelines are a lot more than just the dag which is the description of the steps that run start to end and the little Wikipedia dad picture there you know it's all like all the rest of your software it's everything that has to that you need to do to put it out in production and run it somewhere for real right so you know at a very very high level the pipeline model is that you've got this dad which describes the steps you need to execute you've got the configuration which passes the arguments to it at the time it's invoked and run and all the rest of the code dependencies that it's going to run all that needs to be somewhere and it needs to run somewhere so this is a high level extremely high level sorry simplified a picture of what this looks like an air flow or an approach to it so you've got to figure out how to package as I mentioned put that in a container for example let's say you're gonna run things that way somewhere and put that on a host somewhere that's the code that's gonna run with airflow you're also responsible for putting the whole airflow service somewhere else so that's on some other host may be the same host there were simplest models to put everything all in one place and airflow in fact has three processes that you're thinking about running there's the scheduler there's a web app and there are workers so the simplest model is all of those are on one host even with your code you could put you could do that not that's how we started but even then you know you also need to have separate my sequel that airflow uses for the scheduling state Redis for the task queue which are actually the it puts these tasks up that the workers then consume and that's how the pipeline actually runs so you need all this infrastructure configuration I need to do all this stuff yourself and then you need to keep it up and running right so AWS despite the fact that it's a managed service by name as soon as you get out of the the trivial support it has for doing things within what it what it offers fully managed you're doing a lot of the same stuff yourself you you're figuring out how to package your code and put it out somewhere on a container so we do that in ECS right now you're running this separate process called the task runner which is an AWS application that you configure to point to where your containers are so what you're getting out of the managed service is really largely the same thing as what you get from airflow which is a scheduler and a UI and it's a painful UI so it's sort of second reality check which is this is not serverless yet maybe some other speakers will address that or have and it's not no ops like you need to do all the same things you need to do with the rest of your applications and you're going to want to do them probably the same way because there's no value in doing them differently and there's a lot of value and having this thing run the way the rest of your stuff runs so okay I just spent the first chunk of this talk talking about all the things you're not going to get from the framework providing some specific examples so is it worth using one of these and I would argue it is these are sort of very high-level obvious reasons why you know basically this whole domain its generic right that's in fact the entire value of it and the entire point of it is this general purpose way to put processes out somewhere out in production and have them run and to use it some this high-level declarative means to do so so you want this general thing but a lot of the aspects of a pipeline system are general and not differentiating for your business so scheduling is really kind of the most obvious example of this problem that seems beguiling ly easy that you want to solve yourself but you should not because it has edge cases and it's also a mature problem that has been solved so don't solve it and scheduling is the same for everyone daily as daily you know also this idea of oh it's just a dag library I can do that I'd learn that in school etc but there's a the whole full solution is a lot more than a fun little dag library right so another argument I would say for picking a framework is just to pick one because you know what we found over being a fast-growing startup over the last several years is this is a really obvious pattern of you've got data here you need to do some computation you need to land it there like basically this whole anything that's not a streaming use case or an online transactional use case or sort of a web app which is a form of transactional use case is this use case of like a process needs to start up and do some stuff and then stop and we need to know did it succeed and then probably it depends on other processes and and all that that's a whole separate problem but the idea is you're gonna rewrite this over and over again so that's probably not that's not high value because this mechanism mechanism itself is generic in general so just pick something commit your engineering effort to that platform getting better and better and being more useful across your organization across your services and you can bake in as I was mentioning earlier all these aspects you can bake them in and have it do everything the way you want it can do logging the way you want monitoring alerting security all that so it's just worth picking something and and you know you do get a fair amount and again it's all sort of general stuff that there's not a lot of value in building yourself you get scheduling you get an API for defining DAGs and we can argue about whether the airflow way of doing it in this sort of almost pure Python that isn't exactly doesn't exactly work like that whether that's the best approach or a visual approach or whatever but in the end it doesn't matter that much because it's you you get you have an API to hit for doing that you don't the design that yourself build out yourself manage that code yourself and again even though you don't get the whole runtime environment you get a target to hit right so you don't have to figure out the architecture there's an architecture that's basically given to you for you for any of these frameworks you pick what config you need what environment variables what supported in terms of where it can run what are the supporting services it needs so you have a you know you got the boxes and arrows already and you know what you need to build and so you're starting there you're starting at a higher level and you know if you hit that target it should work right you also get everything about pipeline semantics to one degree or another you know so you have policy for what happens when one task fails should the next one run or should it stop should it wait or can it start right away so you you actually in have especially an airflow which as rich policy around this stuff you know you have a concurrency policy at the same time as sort of task failure policy just and how how you can define the behavior of the deck you get retries and both the pipeline and data pipeline and airflow handle retries pretty well airflow has very robust support for backfill it's been one of the features I've been most pleased with so it does a really nice job of keeping track of all the jobs that are supposed to run ever and if they fail it keeps trying to rerun them it's also really nice in terms of you can stop it in the middle and it will pick up from where it left off and all this sort of thing and these are all things that are generic to how you pipelines to work they're hard to get right and it's done for you you know and the last thing is you get this you I I should have shown you the data pipeline UI but I wanted didn't want everyone a run out of the room you know so I just I'm showing you the airflow UI but the idea is you get this nice UI it lets you look at status it lets you manage these things manually you can go manually retry things you can jump to the logs all this sort of stuff airflow also has reports on how long things have been running and all this sort of thing built right in so you know again this is a whole web app they you don't want to build yourself because it's just general to managing pipelines you just want someone to do it for you right so not running quite full-time we're almost to the end here but basically wrapping up the idea so what do you get when you put all this together you take the pieces you get from the framework you build out the other pieces you need which are basically going to be the actual code that's gonna execute the way you want doing all the things the way you want integrating the rest of your code the way you want and some kind of set of solutions for putting that out into production and running right and you know I was at a meet-up recently and I think the trend one trend that's happening is people are moving to a higher level declarative abstraction around air flow or another tool another framework that encapsulate all of this right so so give enable data scientists or data engineers customers of this platform the ability to just declaratively say I want this pipeline to run I want it to pull data from here I want it to land data here and that's it and then boom it runs so once you get to that point and that's the path we're going down to and that's the path I've been on much of this year on and off sort of wrapped learn figuring out how to wrap air flow and get what we can out of it but you know if you get all the way there it's really powerful because now you've sort of taken this DevOps idea you could say of declaratively saying what the end state is you want in production of something and you've moved up the stack to basically being able to declare a full application now not just infrastructure right it's infrastructure plug computation and even more so this idea of having the operators be Jeanette generic and know how to for example like I know how to talk to redshift and run a query and being parameterize by config you can compose new applications then by composing the operators in different ways and giving them different config and again you're not writing any new code now you so you've got this very powerful framework I would even argue that it's more powerful than that because so far this model or this wave thing about building applications is largely focused on the data domain but there's no reason to limit it to that operators can do anything and as I mentioned earlier you're going to be seeing this pattern you you have this pattern all over the place of writing these back basically batch job code glue code and you know it's worth thinking about if I have a platform that does all this for me and I can operate at this higher level I can move much more quickly and not keep resolving this problem and you can apply it sort of anyplace you need to do this kind of this kind of computation so if you get all the way there you're living the dream it's unicorns and rainbows right so that's the talk do you think the issues that you ran into with the community operators are a sign that that approach is fundamentally the wrong one or is it just that the project's immature and they haven't been built out yet and if the ladder have you tried contributing back to the project and what was that like for airflow rate so it's a good question I mean you can make things work right so you could probably take you could take one of two paths with this sort of thing you can find a way through making the black box thing work for you and make the set of compromises you need to make the path I felt was much more productive for us was this the ladder path I described of just taking control of that and finding sort of the minimal shim to connect to air flow and go in that direction and give you know basically take all the general stuff from the framework that I don't want to build and where we want to trol keep control right sort of like standard way to think about buy versus build in terms of your question of contributing back I did open an issue about the headers thing and then it was fixed in a later version so to that degree yes you mentioned earlier about a using a declarative interface on top of air flow do you have any examples of any tooling out there that is trying to do that so you don't have to write all the code yourself right I don't know of any projects that are in wide use that are doing that yet I was at a meet-up recently and actually it was meetup itself we're talking about how they're how they're handling this and it was basically like here's a yamo that describes my job and it was over both the dag and and the database dependencies and sort of though where it was going to run and I do think that's a sort of obvious direction that this is going to go in and should go in I also know of you know friends at other larger internet companies they're working on the same idea I mean the way I say this to people is that this is sort of this this is sort of like the assembly language of data pipelines still right like I mean also there's a some folks I don't know the name off top my head but they're writing languages that are sort of data pipeline languages that's another way to deal put a higher level abstraction on the same set of problems as baked in right into a language sort of for example accessing resources and so on but I mean this idea of having to carefully assemble all the steps step by step by step it's does it does not feel like the end point to me at all what what a made you go with data pipeline in the first place and and I don't know it do you regret that decision now that you have to cut over to airflow so we're now getting into my personal history at beeswax so that did that decision predated me I regretted it that it had ever been made very quickly but that said you know you you can make like we're engineers you can make things work so data pipeline really it by the time I got there we had this built out our own set of primitives to operate with it in a reasonable way you know like so we had wrappers for putting the our code in a container and putting that out somewhere we knew how to run task Runner and almost never had to think about it and we had wrappers around basically the making the pipeline piece so that the scheduler knew about it and having it you know all the config and dependencies all be wrapped up in a continue like we had solve that it's just that the you know the gap my point in the talk was more the gap between what it promises and what it delivers is very similar with these two things even though one for the claims of you managed service the others a new open source project and you know you kind of end up in a lot of the same place with both have you looked into AWS glue at all and would you consider that as an alternative to airflow or or not so interesting I actually did look at it around this time and tried to build something with it it felt immature in a couple ways the UI was painful and sort of slow and a little bit unreliable maybe but you know okay but also we ran into some race conditions relying on it as well in terms of sort of not really read your own rights but like read your own DDL rights like and so we just felt like it was a little what and and again it was very high level you I was like a extremely simplistic web informatica maybe kind of it seemed like my take on it was it was targeting more of an enterprise use case where we wanted again you know it's like the same thing to buy versus build we wanted a lot more control over the execution part of this problem anymore quick if you guys consider other tax managers like Luigi or woozi or whatever great so I did look at Luigi just in terms of the features and I I liked that airflow had a more kind of complete feature set this was at the point where I thought I was gonna get everything out of that but I still think you know I haven't done a lot of Luigi do really say but that was kind of my take on it then any final questions we got one over here yeah thanks for the talk so just a design question you decided to you know base off of the base operator um why not just go with the you know extend the redshift to s3 operator you mentioned you know the security thing was a concern but did you just dismiss everything else or do you guys just have other operators that you want to work on right so that's a great question so this is a question kind of between was I solving just that specific problem in the moment are going for a more general a solution so what I decided was that we want what we already knew we needed for example just for this pipeline a hive operator we're looking at wanting operators to spin up EMR clusters and spark and you know other other things down the road so what I was looking for was a general solution to the problem that would let us build out other operators and and solve this problem of being able to customize all of them in certain ways with one pattern and let each of them customize themselves specifically if they needed to so that was kind of why I went with the inherit from the base operator and let us build out our operators it doesn't preclude us from using any can't operators that perfectly fit our needs it's just the dag just constructs one of those and you know you're good but to kind of get us out of this brittleness into a place where we felt we had the flexibility to move forward confidently knowing we can't add this open flexibility that's why I went with that approach [Music]

Info

Channel: Data Council

Views: 22,476

Rating: 4.8771329 out of 5

Keywords: Data Pipeline Frameworks, data pipeline explained, data pipeline in aws, mark weiss beeswax

Id: C6Abv87D5dU

Channel Id: undefined

Length: 35min 33sec (2133 seconds)

Published: Wed Jan 02 2019