Functional Data Engineering - A Set of Best Practices | Lyft

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] my name is Maxine I'm talking about functional data engineering and a set of best practices that are related to this functional approach here I'm using the term functional as in functional programming so I'm gonna get much more into that but before I do I want to set a little bit of the context for this talk so first thing a little bit about me but though pretty much everything's been said so I work at lift before working at lift I was at Airbnb Facebook Yahoo and Ubisoft and while I was at Airbnb I started to open-source projects that are now incubating at a patchy software foundation so the first one is about you air flow it sounds like people in the room are already pretty familiar with Apache airflow air flow is a workflow engine it's a tool to help you author schedule monitor your data pipelines it's largely inspired by something at Facebook called data swarm and it has gotten super popular over the past few years since then so I've been writing a lot more JavaScript lately so it's rare that you see someone who writes a lot of JavaScript speak at a data engineering conference but I guess it's happening now so superset is a data visualization platform dashboarding data exploration and all that good stuff so if you haven't heard about it I urge you to take a look it's getting really good superset has engineers from Airbnb lyft Twitter and Apple collaborating and the project is really taking off at the moment no good we've got Apple asking me to do something with USBC and I will say not now thank you not the right time all right cool so this this talk today is about a blog post that I wrote recently that's called functional data engineering modern prior down for batch data processing this stick plot takes place in a content in the context of a few blog posts that I wrote past few years the first blog post that I wrote about data engineering was called the rise of the data engineer and it was about trying to really understand what is the data engineering role where does that fit in relation to data science Data Platform engineering what does that fit in relation to historical jobs and titles like business intelligence engineer data warehouse architect so this is what this post is about if you're a data engineering you haven't read the post I would pry urge you to take a look and and to read this pose so I really tried to make kind of a manifesto for what did engineering is and then over the years I became a little bit more jaded about data engineering and you know I just realized it's really sometimes an ungrateful work to be a data engineer the tooling is really hard it's a space that moves really fast it's a role that's really underappreciated hopefully other people are going to talk at the conference about that and about solving these problems but so yes I wrote all about kind of the the hardships of being a data engineer nowadays and then more recently as I said I started writing a lot more JavaScript for Apache superset and I got so there's this thing in the JavaScript community where people are getting really into the functional programming paradigm so I drank the kool-aid I got interested in functional programming and I drawn a lot of parallel between the best practices that we use at places like Facebook and Airbnb with the functional programming paradigm and this is what I'm about to talk about today so that some of that stuff is old news I see some of my ex my colleagues from Airbnb at Facebook and for some of them and like some of that stuff may sound like total total old news so there might be a little bit of a sense of like of course you know you don't have to tell people this everyone knows that stuff but that's not always true so as I joined lyft I realized that a lot of the best practices I took for granted in places like Facebook and Airbnb we're not necessarily respected or they're not common knowledge also count on the road show for air flows I've been visiting a lot of companies talking about open source talking about floo and i also realized that in those places those best practices were not coming knowledge as well so so I was like let's write a blog post about this use that as some sort of manifesto again as to a set like a way an approach to doing things and and then you know maybe talk at conferences about it in here I am so before I get into functional data engineering I want to get into functional programming so I'm sure a lot of people in the room here are like fairly familiar with object-oriented programming and that's what if thought in schools I'm sure you know pretty much although the engineers in the room would be able to describe describe object-oriented programming functional programming is just an alternate approach it's a different paradigm it's a different way to think about structuring and organizing your code I'm not going to read the whole definition here from the wiki video page because I didn't come here to read Wikipedia articles but I'm gonna raise a few points that I think are most relevant and most interesting and most relatable to data engineering so I'm gonna start by reading from just one line for from the Wikipedia article that says in functional code the output value of a function depends only on the arguments that are passed to the function so calling the function twice with the same input will result in the same output so that brings me to the concept of a pure function so pure functions and functional programming are functions that are limited to their own scope so so that means that given the same out same input it will provide the same output that's a nice guarantee it's easier to reason about there's no side effects you can easily unit test these functions and that brings clarity to the process right just knowing that this function is limited to its scope is a nice guarantee to have another concept in functional programming is immutability if the immutability is foundational to two functional programming and the idea that in pure functional languages you won't be able to to reassign a variable so variable there's a constant once a variable has been assigned you cannot reassign it once you create an object this object is immutable if you want to modify the object you're gonna have to somehow create a new object and effect it to a new constant and because of that it completely changes the way you code and the way you think about your code if you see a variable that's been affected you know it will stay the same within that scope so I think this translates and I'll get into how this translate into some concepts in data engineering in just a moment another key concept in functional programming is idempotent see idempotence is the property of certain operations in mathematics and computer science that they can be applied multiple times without changing the result beyond the initial application so that means if you have a function that's idempotent you can run it one time you can run it two times you can run it n times you'll get to the same desired state with confidence but a simple kind of stupid example of that would be if I write a function that says add a little bit of water to this glass that is not an idempotent function if I keep running this function at some point the glass of water is going to spill right and if if instead of doing that that I write a function that says fill up the glass with water and fill no more that's an item post and function that can run at any point in time and get to a desired state without spillage so now I'm going to talk a little bit about functional data engineering or try to apply some of the concepts I just talked about and how they map into the data engineering world here when I say data engineering I'm mostly thinking about like batch computation so you know I'm a batch guy because I wrote airflow I wrote the at least the first version of air flow and I'm thinking also mostly from a data warehousing perspective I know data engineering is not just batch it's also streams and it's not just data where they're data warehousing it's all sorts of computation the concepts I'm talking about today certainly do apply to data where else batch processing and in many cases will apply to streams and to other types of computation they may or they may not so the first thing I want to mention is some of the motives for this myth the methodology that I'm talking about functional data engineering is reproducibility so as you know reproducibility is foundational to the scientific method right if you if you're scientist and you make some great claims and you write an article with outstanding results your peers need to be able to reproduce your results of our science as not advanced by an inch right so reproducibility is sometimes as legal also for a minute from the critical from a legal standpoint it might be critical for a miss sanity standpoint or if you're there in a bit engineer and you you write some jobs you run some jobs and you run the same job today as you ran yesterday and you get different results you'll probably go a little bit insane over time maybe you're probably already a little bit insane if you're a data engineer anyways but you know you don't want to make it worse so so this this functional approach that I'm talking about today is all about reproducibility and kind of sanity too so a mutable partition side I talked about a little bit of you know about immutable and amiability being a key concept in functional programming in the data warehouse you can think of partitions or immutable partitions as the building block of your building blocks of your data warehouse all right so partitions become kind of the foundation to your data warehouse the equivalent of nubile objects and because you don't want to mutate anything you can't need to add these partitions into your warehouse one at a time and you have to probably partition all of your tables because you don't want to mutate your tables you don't want to change them every day right you want to you want to essentially partition everything you want your ETL schedule to align with your partitioning scheme so that I mean if you're running daily jobs or hourly jobs you should have daily partitions and pretty much all of your tables and when you think about this I'm sure when you guys think about they love lineage if I say the word data lineage and you close your eyes you probably see this graph of tables right you think my data lineage is this graph of tables but when you use partitioning you can start thinking of data lineage as a graph of partitions so that means that the mental model is a little bit extended here and if you think say of any given row and say that partition in sales aggregation here in this table any row you can easily attach to the partition that it's in and any partition you can kind of infer it's data lineage therefore having this provenance traceability reproducibility and all that good stuff now pure ETL tasks so earlier I mentioned pure functions in functional programming pure ETL tasks are simply tasks that are ident right so if you rerun them you're safe you know you're gonna get to the desired state this is great in distributed systems you say you have an airflow cluster or you have any form of large computation that are distributed taking place if anything fails halfway you don't know if it succeeded if it ran if it failed you know that you can always get back to the state where you want to be by rerunning this idempotent task there's deterministic giving the same source partitions they will produce the same target partition they have no side effect they use immutable sources they usually target a single partition so they're easy to reason about and that also means that you're never doing update absurd append delete you're essentially inserting new partitions all the time or you are insert overwriting partitions so of course that doesn't mean that you should never do this type of operations up there update absurd append elite but if you want a pure kind of functional data warehouse you probably want to always insert overwrite a new object or a new partition and in general you probably want to limit the number of party of source partition that you're scanning just for the sake of simplicity and trying to keep your computation units fairly simple and I'll talk a little bit about a little bit more about that later one thing I want to talk about is this idea of a persistent staging area so the term staging area if you read any of the books about data warehousing from like fifteen or twenty years ago you you'll realize it's not a new term the idea of a staging area is the place where all of your data that makes it into your data warehouse would at some point make it through your staging area usually your staging area is not transformed at all so you'd bring the raw ingredients from your external sources into your warehouse pretty much untransformed and you know in the old books about data warehousing they would debate whether you should have a persistent or a transient staging area and nowadays you know within like pretty much infinite cheap storage I would argue that you probably want to almost in all cases have a persistent staging area so that means any time you bring data from external systems you would load it into your warehouse untransformed and leave it there forever essentially right so there might be some exceptions you might have some very high volume low value data that you might want to you know compress in some way but in general I would just argue to bring that in your warehouse you know make it immutable put it into read optimized file format like Parker or or C and leave it there forever knowing that your data is or this raw data your staging area is this the raw ingredients and knowing that you probably have all of your transformations or recipe you know you can get you can get you can rebuild a whole warehouse from your raw data and your computation at any point in time and knowing how common backfills are it's a really nice assumption to have cool so now I'm going to get into like some set of comments and solutions so these are common challenges around data processing vegetative processing data warehousing in general and solutions that are you know solve in this functional approach to data engineering and bear with me here I'm about to get into something a little bit complicated here it's this idea of slowly changing dimension so who here has heard or would say like they're they know pretty well what's staging the slowly change and dimensions are all about yeah yeah yeah that works I think if you if you read the Wikipedia article that totally counts - so here in the old books about data warehousing so ralph kimball in the lifecycle the data warehouse toolkit I believe wrote about it originally slowly changing dimension is this idea that in dimensional modeling right if you're going to use fact tables and dimension tables to model your business data and typically your business entities entities you will model as dimensions the attributes to your dimension members are typically changing slowly over time and the question is how do you model that stuff in the data warehouse and here on these images I have some antiquated ETL drag and drop tools where people would drag and drop their logic supposed to write in code I don't know how they would check that into github but those that are the early days of ETL and data warehousing and people would drag and drop stuff and they try to get to these slowly changing dimension type - which I'm gonna get all about so this today is all it's not only about functional data engineering it's also a little bit of a primer on antiquated approach - to slowly change in dimension and how to model change in your dimensions so this is taken out of the Wikipedia page about slowly changing dimension to set a very simple examples that that illustrate you know the different types of approaches to slowly changing dimensions I know this is really small maybe people can't read very well so fully you can read enough I'll try to read for you here so the first approach slowly changing dimension type one is where you would simply override the data so here we're looking at a supplier dimension so picture your big business and you have a you know suppliers and you want to structure all your suppliers into a specific dem supplier dimension and then one of your supplier moves so ABC Acme supply supplier moves from Illinois to California the first approach is just to say let's do as if this customer always live in California was always in California we're just going to update the dimension table we'll update that cell and now this supplier was always in California of course if you're doing your taxes that's probably not gonna fly very well so you know someone's going to be like we need to keep track of that history and the second approach that the slow change dimension you know in the books that you'd read about is the type 2 the type 2 approach is to add a new row so given the supplier code which is ABC and the natural key would go and create a surrogate key 1 2 4 and you would price start managing the effective date and your dimension table saying for this time period this is the state of where my supplier is and not only you need to do like extra management of this dimension and slightly complicated it's full of mutations also when you when you load up your fact tables you need to do what they call a surrogate key lookup which is kind of complex and expensive and your FAQ table also will be filled with surrogate keys that are slightly unreadable I need to make sure this computer it is not going sleep mode again but yeah so so you have to do these surrogate key lookups it's fairly complicated in full of mutations the type 3 approach is kind of kind of a half-ass approach where you would create a new column and keep some sort of version of history or what you think is important about a change in time so these approaches have a lot of shortcomings the type one approach is full of mutation you you lose history right if you run the same query today and yesterday and you're gonna get different results there's no way to know where that supplier was in the past in a lot of cases that's not what you need and if you think that's what you need now that's not what you might need in the future and you're gonna have to go back and remodel stuff and reload your data and mutate your data and that's no good the type 2 is effectively super hard to manage it makes loading your dimensions harder makes loading your facts sardar and it makes you kind of forced to do a lot of up certs and complex ETL it's also not a given that you're gonna fall back on your feet if you if you were to say wipe the dimension table and try to go back to the state that you were in you might not get exactly there unfortunately so type 3 is kind of a bad compromise and type and so people over time came up with slowly change dimension type 4 type 6 like people didn't know what to invent anymore these are all kind of the same nonsense or a similar version of composite version of that nonsense so what is the functional approach to dealing with changes in your dimension is super simple it's just to simply snapshot all the data so that means that you're dim supplier for each day for each one of your ETL schedule you're going to create a full snapshot of the dimension table as it was that day so that means that you might have you know if you have like three years ten years worth of data you would have you know thousands or tens of thousand have duplicated data in your dimensions that sounds awful right you're like well you're deployed max your duplicate this data it's slowly changing yet you store kind of the whole thing every day that's kind of insane well so there's there's a few things about this that that I want to say that kind of mitigates the this approach so first is storage is cheap compute is cheap there's virtually no limit there so that's one point then you have the fact that dimension dimensional data in relationship to facts is usually very small and simple so that means that your supplier let's say your company as a hundred supplier or a thousand supplier even if it's like if if your company has a million suppliers thinking snapshots of a million rows nowadays in bigquery presto and Paulo hive is drop in a bucket really and that there's also the thing where storage is cheap engineering timings expensive this mental model is a lot easier to reason about it allows you to have good reproducibility which is invaluable so these trade-offs are trade-offs you might decide to make or not make but this is kind of the best we have and it happens to work very well in places like Facebook and Airbnb if you want to sort here to kind of prove my point that the mental model is easy you can see that if you want to do a query joining your factory dimension all you need to do is join on the key and forced force the the latest partition so it's pretty common now to have macros and your sequel tools that will allow you to do things like that another approach is to maintain a view say that would be called dim supplier current that would always point to the latest partition of your supplier dimension problem solved now if you want if you're interested in the attribute of the dimension member at the time of the transaction what you what you need to do is simply join on the key and join on the partitioning date key problem solved there's also a nice byproduct of snap shutting your dimension is the fact that now you can start doing time series against your dimensional data snapshot so that means dim supplier it could be if you want to count how many suppliers you've had over time you could just say so I count star from supplier group by date and and you can do a new family of queries that perhaps or not that easy before what it slowly changed in dimension type to that are fairly easy to do now so if you ever hear about slowly changing dimension again you can just say all that stuff is absolute I heard about at a conference let me point you to an article and done you can just kind of wash it off and so now I'm going to talk about late-arriving facts and keep an eye on my watch - so late arriving facts are common right I think that people at Apache beam have been beating that horse for a while one thing that's important about late-arriving facts is that you need to essentially have two time dimensions in your data warehouse if you have late-arriving facts you need event time and event processing time if you care about immutability and lending your stuff in your staging area and not touching it ever again and compressing it indexing it into RC files or park' files you need to close the loop on on DS so you need to essentially partition on event processing time so only if you were to partition based on event time you would have to wait for the window to close it would take longer you have to have more data in memory so as much as you can partition on your event processing time so you can close the loop partition pruning might be lost so the partition pruning is this thing that database the database optimizers do where if you apply a predicate like a date filter on a partition field the database is going to only scan a subset of the partitions so knowing that most queries that analysts and data scientists and people in general will fire will be predicated on event time and now you are partitioned on event processing time you might not be able to do partition pruning anymore so here what you what you lose out there are ways to mitigate this so first is you can somewhat rely on execution engine optimization so if you're using Park it files or our C files the execution engine as it hits a certain Park it block it's going to read the footer and see oh there's no event like for the event time I'm looking for there's really nothing in that block for me so the damage is not that good you're scanning more partitions but you're effectively scanning more Parque footers and you're skipping the blocks in a lot of cases you can all to instruct people to apply predicates on partition fields knowing your analysts and data scientists that probably won't happen but you know you can always have hope that it may happen you can also swap partition by event time right so if you know that your partition you want a partition by by event processing time because you know it's important to close up the loop you could could sub partition on sub partition on event time then you get maybe best of both worlds but more files in your warehouse that's a trade off and you can also like if you have a lot of time predicated queries you could also pivot so maybe your staging area is partitioned by by the event processing time but maybe later on in your ETL you rebuild a certain window of data that know that you know might change and you repartition the data by event time in those fact tables all right next topic I want to talk about is self or past dependencies and saying that those should be avoided at all times or not at all times whenever you can so what are self or past dependencies so let's say we agree that we you should snap shut your dimensional your dimension data right so I spoke earlier about you should just create a new partition everyday for your dimensions one thing that might seem really natural would be to take yesterday's say your dim supplier take yesterday's dim supplier and apply a set of changes or operations that might have happened today and use that to load the next partition and the point I want to make here is that there's this thing that I just kind of made up that's called complexity score so a proxy for the complexity of your ETL might be how many partitions were necessary to compute this partition right so if you're if you're loading dim user from your database scrape then you might be just scanning a handful of partition and that's really easy to recreate and reason about and parallel eyes if you use dim user to create them user so that means that if you want to backfill this you're gonna have to do it sequentially and go very far in the past and your complexity score for that partition goes up the roof so one of the reason why people might want to do things like self or past dependencies in their ETL is because you might have cumulative metrics or time windowed metrics some things some things like at lift it might be really great to have like a total number of rides for the customer you know in the customer table that's something that's super useful but I would argue that dough that metric is useful maybe it should not live in a dimension right should metrics live in dimensions perhaps maybe sometimes but if you're going to compute it I would I would in general say please make sure you don't do a self or pass dependency and rely on a specialized framework that would be good and optimized and efficient at computing cumulative or windowed metrics know for unfortunately this is like out the scope of this talk for me to talk about a specialized framework that would achieve this I know those things are fairly common I should probably blog about it at some point and I'll be at office hour next in the next hour if people want to talk about this topic specifically now I want to talk about file explosion so I know at Airbnb we were like partitioning everything right and sometimes sub part in sub partitioning things and that leads to that leads to a lot of files in HDFS or in s3 or whatever storage you use for your data warehouse you know each partition as at least one file and if part if you partition the hell out of everything that your Hadoop name node or HDFS or s3 somehow will suffer at some point because we have this explosion of files so there are some ways to mitigate this one is to be careful around sub partitioning and perhaps being careful about and having short schedule intervals right like hourly hourly is probably okay five minutes starts you start to get a lot of files you know if you're if you do things on a five minute type window there's also this idea that you could compile later earlier partition so if you have you know multiple years worth of data and you partition by day perhaps 2010-2011 maybe all the way to 2000 you know 16 you can you can kind of assemble these partitions into less partitions I'm not going to get much deeper into this that's something to keep in mind because it's a byproduct of this approach that's somewhat undesirable or that needs to be handled or mitigated so I'm already getting into the conclusion portion of this so that I've got two points I want to make around my conclusion first is that times have changed a lot since the original book on data warehousing were written right and the landscape and the technology landscape has completely changed and we need someone to to go and rewrite these books I like to write code I don't like to write books I'm not going to rewrite these books but I think some of the core things that I've changed that really changed the methodology of how a data warehouse should be built and I want to talk about briefly one is we have cheap limitless storage distributed database the distributed databases that is virtually infinite compute so we have seen the rise of read up to my stores an immutable file format right at the time where these books were written people had Oracle databases that were these stores that are like effectively super super mutable you can go and change any cell and you have transaction and all that stuff nowadays if you're using presto hive Impala bigquery Athena whatever you use you're like the data is stored in these immutable segments that are indexed encoded compressed column knives and you know the practice of going and updating stuff and mutating data not as easy anymore but in return you get these like read up to my stores that are really good at data warehousing and another thing I would like to point out that has changed is that the time where these books were written you had small teams of highly specialized data professional building the warehouse for the company I think that's not true anymore I think everyone is taking part in the computation data warehousing process nor at least of the companies that I've worked over the past five or six years and everyone is welcome to use and create and mute and change and shape the future of the data warehouse in the data in the company now the last point I want to make is first learn the rules and then break them so that's you know any methodology whether it's object-oriented programming functional programming whether it is this functional data engineering thing I'm talking about and it's good to know the rules but I'm sure in your environments you have all sorts of reasons to go and make up your own rules break those rules and as long as you know why that you're disgracing and why you are it's usually the good thing to do so thank you everyone that's all I'll be in the office our next door so thank you we have time for questions hi thanks really nice talk I really like the perspective on kind of taking a full snapshot of your your dimensions every time but the question I have about that is if you if you're doing that don't you run the risk of turning what could be like a small data problem that fits in a Postgres warehouse to a big data problem but this is how you get into big data right by generating data you don't need so so I would say for all I think it's a no-brainer for all of the very small dimensions right then supplier if it's a hundred or thousand rows a day I just don't bother don't do any complex modeling just snapshot your dimensions if you work at Facebook and you're dealing with a user dimension you might want to rethink that a little bit because you have billions of rows that you're duplicating every day though I've seen I've seen it done like when it becomes the practice people are so used to that mental model and the uniformity then we're used to say working with with hive and park in the Menace store people will will just kind of move forward with this that doesn't mean that for larger dimensions you cannot do a mix of techniques right so what you could do is you could say for my larger dimension I want to apply some of the concepts of the type to dimension or I might want to have lower retentions for for these tables or I might want to do what I would call vertical partitioning which would be trying to keep as much as possible the fields that might fit nicely in your dimension into some sort of fact or some sort of like user metric type fact table so I would say the way to mitigate this is to keep maybe your dimension table a little bit thinner less fields model your data differently and maybe in some cases if it makes sense to go with a type 2 approach you know where where it makes more most sense I really like your approach to kind of the persistent staging area but let's say hypothetically a regulator in Europe decides that people can request to be forgotten how do you sort of deal with that situation so you can come and work at lift where we don't really have a Europe branch just yet so we don't really have to deal with that right just now or maybe we do I don't know I'm not really good at legal things so I think I think you probably at some point need to anonymize your data I'm not sure if it's enough to anonymize your data so then you know I know it's common for companies to have a retention and an enemy's Animas ation framework so that means that you would add metadata to your table properties saying this table contains non-anonymous data and needs to be anonymized you can't have a frame a framework that knows about the fields that it needs to encrypt hash and then to have this background process kind of like a demon that comes looks at the table and folds them on themselves so moves the immutable partition into an anonymized equivalent that partition and works in the background and tries to respect all the rules so that's the best I can explain here not and I don't know what the legal constraints are exactly but I know it's common to have these anonymization frameworks just one ask question about item potency it's real important in pipelines to do things once with idempotency you have any comments on item potency in email pipelines so pipeline so instance you need to you know perhaps you build an aggregation your aggregate you sending data you build a report and then you need to hit a list what you want to make sure somehow that you only sent them once for whatever reason real life story we were sending out emails we sent out a few thousand a day we had a problem with our open TS DB client or just accessing it and so after the email was sent out there was a failure with the TS DB metrics right and as a result the job was failing and our version of air flow and so we kept redoing it not ending multiple emails at same exactly yeah I think I think that you know so what one thing is like what when you're sending an email to someone that is not an idempotent operation or maybe the function that sends the email itself needs to have some sort of memory and UUID some sort of unique identifier saying like when I send an email and now I'm not talking about the idempotent data warehouse function I'm talking about I didn't put an email sending function that would say something like did I ever send this email before with this UUID and if you know and if so do not send it again or so the idempotent function is send this email if it was never sent before and somehow you need to keep track in memory of what you've sent or not in the past you
Info
Channel: Data Council
Views: 44,542
Rating: 4.9262471 out of 5
Keywords: functional programming, data engineering, ETL, maxime beauchemin, lyft data engineering, functional data engineering, data eng conf, batch processing, batch data processing, functional data engineering best practices, pure functions, pure functional programming, functional programming immutability, functional programming idempotency, idempotency, batch computing, batch computations, immutable partition, pure ETL tasks, max beauchemin, maxime beauchemin lyft
Id: 4Spo2QRTz1k
Channel Id: undefined
Length: 39min 43sec (2383 seconds)
Published: Thu May 24 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.