DC_THURS on Data Lineage w/ Julien LeDem (Datakin)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back everyone to dc thursday i'm pete soderling i'm the founder of data council and data community fund and we have another amazing guest lined up today in our series of uh uh interesting people who have been foundationally um instrumental and helpful in building the data ecosystem and uh today i'm excited to have julian ledem with us uh he's the co-founder and cto of datakin currently but julian has also been involved in some instrumental projects that we all know and many most of us use um he's the co-author of parquet she's been very involved in the development of apache arrow as a pmc member and previously he worked at wework at dremeo as a staff staff software engineer at twitter as a principal engineer at yahoo and he studied math math physics and cs in france so julian i want to welcome you to the show and thanks for for being on with us today well thanks for having me so the first question that i just wanted to ask you and i hope i hope this is okay is you've been in the industry for so long i think you came to the us maybe in 2006 or seven is that about right 2007 yeah 2007. so you've been involved in so many instrumental projects i was just wondering why it's taking you so long to start a company in in the data world well yeah i think you know it's funny at the same time it took a long time but it feels like everything i've done so far kind of led into this into um building a data platform building open source projects um and really creating communities and becoming part of this community of this data community and understanding it better it's kind of all led to um to that point in really how do we solve um depending on each other and like operating data platform and data pipeline at scale and when i say at scale it's not like just big data scale like a lot of data but it's really today's like organizational scale right where you have lots of teams and people depending on each other producing and consuming data so it kind of i feel like a lot of things that i've done in the past led to this point and it's kind of because i've been involved and because i've started open source projects and i've helped grow others that i got there right well it's a definitely it's a blessing when you see that your career has when you look back over your shoulder and you see that your career has had a series of steps that have actually been joined to an apparent productive and and and uh i i know the feeling and so um i'm glad that you that you have that sense as well so so let's talk about some of these key moments because i find my conversations with founders that it's it's often understanding what those key moments were and what the key insights were along their way in their career um that definitely have sort of led them to to where they ended up um during our conversation and so we're going to have some awesome chats later about open lineage metadata some other things but um you know you have such an amazing journey from company to company that were significant in the data space i just wanted to stop along the way and chat about some of those experiences so maybe we can start at yahoo because i think yahoo is likely your first meaningful data job if i'm not mistaken yeah that's correct so it's like the company i was working for back in the day got acquired by yahoo and so i ended up working for yao and so in 2007 i moved to the headquarters and started i joined that team who was building platforms on top of hadoop for the more media uh companies of yao so you know like yahoo shopping yeah news yeah movies yeah autos a lot of those things so we do a lot of collecting um listings or news articles or used cars listings or different things like that and so there's lots of text processing matching products to catalogs matching you know extracting entities for news articles and bringing structures to all those listings and these uh articles or like movie description and bringing that structure and that kind of helped build this guided navigation um and that you would find in a shopping comparison engine for example and so that so on so it's kind of building a generic platforms for those kinds of properties on top of hadoop and of course at the time yahoo was kind of the place to be to use hadoop right that was really the beginning of hadoop and i remember seeing i was in the same building as the hadoop team at some point and seeing hadoop team members with the t-shirt that was saying in the back scaling hadoop to 20 nodes right you know that's what's going up and to the right uh and it was like 20 nodes yeah um so that those were the days and of course like um over those four years those that was maybe an old t-shirt already at the time but it's kind of of course it was like thousands of nodes by the end of it um and so at the time we were building you know how do we do batch processing and this transformation and this enrichment on top of hadoop and it's kind of you end up with many teams you know consuming and producing data you have like these different properties and different teams depending on each other and you start thinking about how do we apply the service oriented architecture principles for this kind of batch processing because now the interface is not a service api right it's a data set right the interface is a data set and the data sets at the time they are shared in a distributed file system you know you have the hadoop distributed file system and so there's a folder somewhere in a shared file system that is shared with some other team and the if this is the only thing you do and you said oh people write files to the file system and share it with others it becomes a big mess right you need ways to coordinate to trigger a downstream consumption when a data set is updated all those mechanisms that now we're kind of you know more used to with warfare engines and so for me that was the beginning of thinking really okay how do we define those service oriented architecture principles what are the interfaces how do we depend on each other how do we trigger downstream transformation how are we aware of the age of data set and that's kind of kind of thing that started this and also you know at start at the time i started contributing to pig because at the time you know we were still writing raw mapreduce jobs at the time and pig was a better mapreduce and it's kind of piggy has kind of lost momentum since then but at the time it was a better map producer it's kind of the niche that sparked took over now now spark is the better mapreduce and kind of uh really out lasted every other competitor in that space at the time pig was big and uh and there was a strong community around it uh like in particular with obviously yahoo who started the project netflix linkedin and twitter and it's kind of that's what led me to my next job right like a big committer that was my first commentership the apache project um because you went to you went to twitter next right and then you um you worked in the data team at twitter in maybe 2012 or so 2011 okay twitter end of 2011 and that was through this community right through uh the peak community i met the data teams at uh netflix uh linkedin twitter and i decided to join twitter at that time and so i twitter yes i joined the data team so in that sense that was more the traditional way people think about big data they think about time series you know we know about how people are using your product what they click on what they search for and things like that and so lots of time series data everything collected on hadoop um and at the time you know twitter had hadoop clusters that were able to store a lot of data but of course you know when you run a pig job or a mapreduce job you know you go take a coffee and you come back later right so it's kind of optimized for throughput but not for latency from the user perspective so it's kind of can process a lot of data uh can store a lot of data but it's kind of slow iteration for analysts or data scientists or people kind of trying to figure out what the data is and they also add which is a data warehouse so it's more like sql lower latency but of course it didn't scale as well as hadoop right so it would always have a fraction of the data and vertically analytics exactly virtually and so you know when i was working on top of the adobe side of things and helping people with pig and kind of and looking at hadoop and raw oriented storage and it's kind of high latency and vertica but that didn't scale as much but was lower latency better experience and so that kind of led into parquet so i started prototyping something and from my work at yahoo i was always already you know aware of the dremel paper and uh i i had been when i was at ciao had been thinking a little bit how that would have been useful for us at that time but i didn't really get you kind of further than thinking about it and at twitter i really dug more into it and really okay how do we convert can we bring more of the vertica properties into how can we make hadoop more like vertica and it's kind of uh and getting getting the lower latency the better response for queries and so on and that's kind of a starting prototyping something and implementing so the dremel paper has two parts right there's the execution engine aspect and then there's a storage layer file format that they describe and so i started like implementing the algorithms that were described in the paper and and and so that's kind of led to i had the beginning of something and i reached out because i knew i didn't want to build something proprietary to twitter and that we would have to integrate with everything and that wouldn't make any sense reach out and that's where i connected with the impala team at claddera and they were prototyping a columnar format as well for impala because i mean they were kind of following the same train of thoughts right this kind of like that made sense and so we joined forces and it was a great collaboration because uh impala being all native execution native code execution reno for them it was getting all the jvm ecosystem that this collaboration i kind of brought all the jvm implementation side of things and for me like building this it brings the validation of an actual you know columnar query engine on top of it right it's kind of not just because i was hooking that up into pig and cascading like we're using scalding that is built on top of this time so that we can we could support the twitter use cases and then they brought the query execution fast core execution use cases that brought a lot of validation to the format so we join it's kind of we merged designs some of the design aspects i brought some of the design aspects they brought and that that made the success of parquet i think it's kind of and you and and you open sourced it and became an open standard yes yes and then they originally had different names so we each had a name and we kind of brainstormed a new name and that became parque what were the what were the names uh so on my end um it was called red elm red elm radium and it's an anagram of dremel okay right that's what i was going for but i was kind of thinking oh we do the whole thing right but actually i think it made more sense to kind of like focus on the file format and like enable the ecosystem there were like so many uh sequel and hadoop project happening at the time uh i think that that made a lot of sense to have like a common storage layer for them um rather than building yet another one got it so so you stayed at twitter for a few years and then you ended up uh probably with your closest brush maybe to date at starting a company i assume i mean you were very early team at drumeo right yeah i was i was on the you know day one uh at the starting uh of germio so really can't start a new so part of the funding team uh uh adremio and um and so drameo was kind of going at you know it's uh and drumeo is very successful since then um uh it's a there's on one aspect of it is the federated query engine that lets you query all your data where it is and it has the ability uh to materialize um some of the transformation so it can like it has a smarter uh optimizer that can take advantage of that to speed up your queries and you know the idea is look you plug your tableau to your data lake and you get the same experience than if you used um you know a more um data warehouse or a cube or something without without having to worry about importing all your data like selecting what subset you put in the in your data warehouse like we were doing at twitter and i think this is what's really related to all the work i've been doing on parquet and how do we optimize korea id like part of it was the storage and thinking about the optimizer and it's kind of the wider query processing aspect and as part of that that's how we started we started aero projects right so it's kind of as part of this parque community you know the next step was to start discussing oh how can we have um in-memory columnar representation so we have as parque we have on this culinary presentation but we also need an in-memory columnar representation and it's a different project because you optimize things differently when you optimize for pulling data from disk on whether you optimize for processing in memory uh really fast so really the criterias for like the layout are very different it's still a culinary presentation but how you pick encoding and all of that uh is very different because of the different characteristics of disk and memory and so that kind of you know different people got connected like so dremeo was started by the creators of apache drill which uh sql and loop and they had an in-memory columnar presentation and the impala people are saying oh we need an in-memory columnar presentation and so in some ways i connected people and we started a larger group and we say okay so how do we do this together and so we decided basically arrow started as spinning off this columnar representation in drill and refining it based on everybody's use cases but we started by doing a kick-off by reaching out to a bunch of people who had this use case right like so you had uh the impala people west makini from pandas that who was very interested about this uh you had the dremeo folks and drill so we contributed to that i kind of represented also the parque community in this aspect and now parque in errol is very well integrated together and is really fast for example ardremio uh really fast or you know others uh it's integrated in spark now bigquery returns error result sets and also bigquery it's interesting because bigquery both can consume parquet and can produce error result set the reason they're doing that is because of course you can load your bigquery result sets faster in in pandas because um was it in arrow well that's that's a great story about how you like got the community together around this format and um this sort of you know cross well across the industry right from multiple companies um it's neat that you were essentially able to do that twice once with parquet and and then again with arrow so to be to be fair arrow piggybacked on the parquet community as well right start from scratch but it's kind of i think yes it's an iteration of like how we build communities and it's not exactly the same one but i mean the motions are similar right how do we build something how do we own this together um and how do we make it happen as a community and so then so next you went to wework and which um surprised me a little bit at the time i have to say i have to be totally honest but i know we work with building an amazing data engineering team and um you know they had big visions for for what this team was going to accomplish um what what was it that you learned at wework and sort of how did your career take the next step there so at yeah we work i think as switch gears a bit a little bit because i became uh basically the architect for the data platform and that was the time where wework was growing exponentially right so um when i joined there was relatively little data platform you know there was redshift and uh some in-house scheduler and data was getting imported in redshift and that was it but in that phase there was growing um a lot and so as part of introducing in the ecosystems there was like how do we ingest uh event into kafka and enable collection of events at scale right rather than just importing the data that is already in the production system um how do we enable stream processing and batch processing but working together right how do we define a nice storage layer you know using more uh modern way of storing data in files like iceberg or delta lake uh that we work we picked iceberg um at the time so it's kind of creating this nice foundation and how so a much more generalist role in that case i mean from the data perspective of understanding what's a data platform how do you put streaming uh ingestion of that data stream processing batch processing storage layer whether it's streaming or in data emotion or data addressed nbi and machine learning right it's kind of like enabling all of that making sure schemas are defined uh people understand how they depend on each other and so on right and while building that a lot of this was okay how what do we pick what's the best of breed existing open source project for each of those components like so picking iceberg first address storage uh in s3 picking uh you know spark for uh batch processing um snowflake for the warehouse and and things like that and making sure everything connects well together and i think one of the big piece that was missing in this ecosystem is this metadata management right like it's kind of how do we have a lineage and metadata repository so this doesn't become a mess right we know what all the data said that exist all the jobs that exist how everything depends on each other so that teams have visibility right you have many teams in the company that consume and produce data and they depend on each other and they need tools to understand how they depend on each other and understand that they're not going to break anything or like or if they break something they're aware of it and they know how to fix things quickly right like to to be agile and be able to move fast as an organization you need to have visibility in what's happening so that leads to the creation of marquez and marquez is this lineage repository lineage and metadata repository that keeps track of everything that's running all the updates to data sets and all the dependencies right how everything depends on each other so that's kind of that's what the the core of this and like enabling this data platform so i wanna i wanna drill into this because um i think we arrived at you know one of the key concepts that we wanna talk about and so i wanna make sure that listeners um understand fully what you mean by uh metadata and data lineage and also like what the implications um for those concepts are because i think the things that we're starting to become you know slightly more familiar with as an industry but um i'd love to hear it from your words um if you could just tell us like what what is data lineage and and what does it mean so it's interesting because i think depending on who you ask you'll get very different answers to what is lineage right there's like one one of the concept of lineage is like i want to understand how this data set is derived from others right like oh this column is here's the expression that produces this metric or how this value is derived and what table is derived from and from a certain perspective it's very static it's caring about what's the current state of the system how is this column derived from another one and that's one common uh definition of lineage another way of looking at lineage is really to understand how that changes over time as well right so i think that when you look at lineage only from data set to data set perspective you actually lose track of a lot of the metadata and what's happening because a data set especially in the data world exists because it's being produced by a job right and like the runtime characteristics of that job and how that changes over time are really important to understand how that affects the data set so in terms of you know when we think about the lineage that is stored in marquez it's trying to be very granular and to keep track of every run of a job and for a given run of a job um let's say you have a recurring job that runs every hour right so a run would be the hourly run of that job and you know what version of the data set you consume what version of the output data set you produced what was the schema of the input and output at the time it run how long it took what was the version of the code uh you know if it's starting git what was the gitcha what was the query profile you know statistics around performance at the time what where you can store what were data quality metrics and assertion if you run like data quality libraries on top of your output at the time and you keeps and you store that so now you have lineage but it's really at the run level or if you look the data uh perspective it's at the data set version level so you can see over time how have uh the metadata has the metadata change on the data set so you know something that come in mind or if something's broken you can care about when did the schema change what did it cause downstream uh what caused this schema change upstream maybe someone changed the logic somewhere um or you may care about data quality metrics have changed suddenly we have a drop in row count or we have an increase in null values in a column or the distribution became very skewed suddenly and you know and you can keep track of not just how that changes over time but really connecting that to which run of the job caused that change and what's and you're saying that that's that that you're saying that that's absolutely required because um having an operational view of like how the data is being transformed multiple times in the system over time is key to understanding the full lineage of that data it's not just about like this column is driving some other column and some other data set somewhere it's like if you don't take into account the operational changes and processing around the data then your your data lineage is incomplete anyway it sounds like that's that's what that's the discovery that you have yes and i think that's one of the of the reasons we built marquez because they were existing lineage solutions in the market right like a linear open source project and things but often what they did is they focus on a specific use case and they collected the metadata they cared about for that use case and dropping on the floor a lot of the informations we cared about and i think one of the of the findings of this is like by having a really fine grain operational metadata of course once you know about how a data set is connected to a job is connected to another data set you can transform this to remove the job right and get the data set to data set lineage and feed that into a system that needs that to present other kind of perspective on data lineage and and so you can turn the more fine-grained lineage into like data set to data set lineage but you cannot do it the offer other way around right so it's kind of was really important to start collecting at the right granularity keeping the meditative at the job really understanding how things get transformed and because that informs figuring out all the problems we might have with our data can you give us an example of like what what's a common problem that could be solved if a company had access to proper data lineage that they might not even think of in those terms yet yes so i think you know people care about a lot about data freshness and data quality right so you see a lot of solutions that come out to solve that from a monitoring perspective right they care about so when you talk about data freshness um there's two road causes to that one is something's broken so it's not running anymore and therefore the data set doesn't get updated anymore and second everything runs slower than usual and the data set gets updated but it gets updated very late so there's kind of this freshness uh problem and from a data quality perspective is the data is wrong right and this could be something there's been a problem at some point uh someone pushed a bug or there's a bad third-party data that has been ingested or something happened that creating this uh skew in the data or something and so when you talk about data freshness and data quality there's a lot of things a lot of explanation that focus on monitoring that how do you know if your data is correct or not how do you know if your data is up to date or not and so that's great right that's about being aware when something is wrong but the very next step once you know something is wrong is you need to figure out the root cause of the problem the root cause of the problem is extremely rarely in the data you are monitoring itself the root cause of the problem is almost always always upstream from it right so you may be looking at your um you know executive dashboards that executives in your company like consult every morning before the breakfast and when it's broken or it's wrong this is bad and but what you need to understand is all the lineage upstream that leads to this dashboard you need to understand whether a third-party data injection that it depends on caused the problem you need to understand if any change in the data pipelines in between the ingestion of data and the delivery of the dashboard is problematic or is failing for some reason so really this operational lineage are really detailing edge of everything how change over time is really key to understanding why things are slow why things are incorrect and fixing it quickly because i think the main problem it takes a long time because when you start looking at your data then you see the current state of things and you don't know or it's really hard to figure out what has changed when correlation with their output data being bad or starting to become very slow to show up so you're saying that that lineage is the key to a lot of common data problems and i i think you also commented on a shared observation that i've seen in a bunch of the companies that i invest in as well it's that um you know many of these companies sort of but they want to solve a particular problem right data quality or you know monitoring or performance of some kind but they end up sort of backing into their own representation you know data cataloging but they sort of back into their own representation of you know metadata as far as they need it for their for their end solution um but didn't start with metadata or didn't start with lineage they sort of start with the problem which is a valid place for any founder to start but it seems like they all start to kind many of them start to converge into the similar sort of you know necessary needed back end where they're tracking lineage and and metadata to the extent that their product needs access to it and so it's been interesting to see the the data ecosystem sort of emerge along that pattern in the last couple years and it seems like you you've seen that as well yes so i mean it's key to understanding the root cause of the problem and the next thing that's really important is start preventing problems all together because you can identify upstream of lineage to find the root cause quickly but you can also leverage downstream to prevent you know before you're going to push a change to production you can validate whether it's going to break something downstream or not right dependencies and it's really key to you know we talk about data pipelines reliability of making sure you can trust your data like that the data is delivered on time and it's correct and a lot of understanding lineage to me is really core to uh enabling that and being more efficient and avoiding being in this state or everything is always broken or worse you don't even know if it's broken or you don't know what to fix first and so you know and what impacts what is really critical to that yeah i totally get it um so if if lineage is so key and if it's such a panacea for many of our modern data problems why is it so difficult or challenging for companies to implement some sort of solution in this area like what are the main pitfalls and the hurdles that you see so i think a lot of the lineage solution that that we see in the ecosystem they focus on how do we extract lineage from all those systems like people might be using spark they might be using using warehouses they might be using a lot of pandas a lot of different systems and so they start focusing on how do i extract lineage from each of those things and they start depending on all those projects and the internals of all those projects and really starting being intimately dependent on all those things and it's first it's very complex there's a lot of work to do this and second it's also very brittle because as this system changes and you depend on their internals they're not usually part of the official maintained contract that needs to be backward compatible so all those lineage integrations first they're expensive and complex to build and second they need to be maintained and they need to be re-implemented and fixed over time and then you have to deal with the complexity of not just maintaining for one version of spark right but like for three four version of spark like three four versions of airflow and all those things evolving in parallel and there's a lot of complexity to all of that and so it's kind of um that's kind of what led to you know starting up a lineage and saying i think like the conclusion was we really need to solve that in the open source right and it's not just like i think we as engineers we have a tendency to kind of look at systems and see oh i can read the code i can understand this and i can reverse engineer it and like produce lineage from it but as people we actually need to talk to each other because that leads to very complex solutions to kind of avoiding talking to others and and just figuring out by yourself and how do we and it's kind of how do we define a standard so that instead of everybody kind of trying to pull the lineage out of the internals it's like each project can actually publish their lineage in a standard way and simplify this problem right so wait i think i think i see where this is going um julian is going to talk to other people in the community and for the third time in his career um he's he's going to work to create an open standard around lineage i sounds like i think you just let the cat out of the bag but it sounds like it's actually called open lineage um so tell me if i'm wrong yeah so actually we started that end of last year um so it's kind of and i really you're right right i took a card out of the um arrow playbook and there was this conclusion it was like look each um you know there's a need for multiple products that ex exploit lineage and like it's presented for different use cases whether people care about data catalog they care about operation they care about governance compliance they all need lineage but they have different use cases so they need different solutions on how to look at that however the lineage collection and like having this fine-grained understanding of dependencies this is all widely applicable and everybody needs it and it would be much easier if we could define a standard and we could say well this is a standard way people expose lineage and consume lineage and so that instead of having this complexity and maintaining different integration for each of the projects that care about lineage um we can just like have a standard and you know it remove duplication of efforts and everything is less brittle because each project can actually expose their own internal in a lineage or metadata in a standard way and there is a standard way of doing that that we maintain and it's not changing all the time right so that so we did like arrow right we kind of started reaching out to people in the community that were interested in that kind of thing and like 90 of the projects we reach out to were like of course we need a standard lineage right like like how come does it not exist already like it's been like i don't know like 15 years from my perspective how come we don't have a standard lineage yet well the thing is for it to happen we need to plant the seed we need to come together and say like okay let's do this and push it and create a focal point where we can standardize this and make it happen so that's kind of that's what we did so like open lineage uh we started kind of like relatively small community you know reaching out to people uh from dbt from air flow community from prefect from pandas from spark from uh munson algeria it's kind of reaching out to an uh great expectation and i apologize i'm like forgetting people but like getting together and talk about what's the right way to do this um and started up a lineage and actually last week it's also officially been submitted to the lfi and data so this is the sub-foundation of the linux foundation you know like the sister to ncf but for so it's like the cncf for data world okay great that's great that's so that's very cool development yeah so just so it's just um it just happened so it's like you know it's like arrow the goal is it's not a project that is owned by anyone it's really something that we need as an industry as a data ecosystem we need another way to do this we need to build that and um and the more people contribute to it the more everybody gets out of it right so that's right good open source motions i think for that kind of project uh that helps build momentum and creating this uh open source flywheel of um of getting adoption like you've seen in parquet and arrow it's kind of like there's lots of value of trying to build those standards in open source i just uh we just added a link to the openly lineage project in the youtube chat so that folks can check it out um so before we talk a little bit more about you know in-depth um about some of the architecture and design principles of open lineage i just wanted to like clarify one thing because um i know that you were working on marquez which was um you know sort of an original data lineage project at wework and um you and willy have spoken about that at data council in years past um so i just want to understand sort of generally speaking like what's the difference between marquez and the new open lineage project just so that we create the right space in people's minds in terms of comparison contrast yeah so you know marquez is a lineage repository a lineage and metadata repository and um as i have mentioned before you know like one of the findings is like well they're very different use cases to uh inspecting lineage or presenting lineage like whether we talk about governance or data catalogs and each system really wants to be able to store it and index it in the way they care about right depending on their use case so people care also that privacy engineering is another use case that people care about and so and there was this uh conclusion that oh there's many projects that care about extracting lineage from other systems uh and so there's lots of duplication of effort and this is hard and so on right so open lineage it's it's really and it's kind of came from the realization of we need something like open telemetry but for data pipelines so open telemetry is a cncf project and what it does it's standardizing metrics and traces and log collection for services right like in traces for services is really similar to lineage for data pipeline right except of course like services and data pipelines work differently so you need kind of its own api but there's a strong parallel of defining a standard for collection that then can be leveraged by many tools whether it's open source project or third-party vendors or various other things so really opening edge focused on just defining the standard for collection of this lineage and then you can have multiple consumers you know and marquez is the reference implementation of it so marquez uh is a open lineage consumption endpoint and openly is an api on how you emit and consume uh those open lineage events so that you understand really at the run level and very granular lineage information so it's kind of that was openly images open lineage is a standard for lineage collection and marquez is a reference implementation and of course uh there are others um who um working on uh implementing uh open it okay that that helps yeah thanks for clarifying that um so so what are some of the core design principles of open lineage like how do you go about thinking through how to build something like this yeah so i think one of the goal from my experience building those uh open you know standards in the open source and it's kind of it's not it's a bit different like we're not like uh a standard body right like it's not like there's not much i've been appointing and it kind of meeting us and kind of work for a long time on having a big monolithic spec but really the idea is in open source it's really good to decompose every atomic decision like really have their own conversation and their own life cycle and you really want to avoid have a big monolithic spec that defines everything and so what we did so starting with this core group of people who were excited about like working on open lineage we started with making this course back which is really what's the minimal thing we can define that collects lineage and how do we make it extensible so you can add various pieces of metadata that different people care about for various use cases right so the core model of open image is very simple there's a notion of recurring job that is identified by the um unique name that's running like your typical hourly jobs or daily jobs or monthly job but it can be various other things right there's a notion of run so there's a run id of like this is an instance of this job running and then there's notion of inputs and outputs so you have a run of a job that is reading from this input data set and writing to this output data set and same thing having it has a consistent naming policy so that we can stitch this lineage back together and like two different system will call this data set the same thing and so this is the core and it's just capturing the lineage and it's very simple and and so we that's the first thing we did as a group is like let's solidify this this is the core and on top of that there's this notion of facet and a facet is kind of an atomic piece of metadata so each facet is its own little spec right it's all defined using json schema to sp uh um specify what the fields are what's the representation and so you can attach facets to the job through the run to the data set and so on uh on the data set for example it could be the data set uh schema at the time the job run right so you can track whether the schema has changed uh and what change its schema it could be data quality metrics at the data set level you know what was the row count what was the number of null count for each uh column like or distribution or things like that if you implement something like great expectations for example um and then at the run level you make stores things like what's the query profile uh for this particular run how long it took um what was the at the job level which could capture what's the verge their version in source control of the job and keeping track so you can start correlating changing the output schema with what was the version of the job and and stitching back the lineage together so it's kind of the facets or the facets so if there's three core um types there's job run and data set you mentioned and then the facet can be added to any one of those three to sort of further define specific metadata for that particular instance is that the general idea that's a general idea and so the core model is very generic and the idea is also so first you can add optional metadata depending on the context that may or may not be available you may or may not have a schema depending on what you have and also you have different types of data set right a data set could be a folder in an s3 bucket it could be a table in the data warehouse it could be a topic in a kafka broker right so it's kind of and depending on their type like this they will have different facets right you have different pieces of metadata you track differently what was the version of the data set from one to the other right if you use the version data storage like a delta lake or iceberg for example you can keep track of what actual version was produced by this run right which lets you really easily roll back using time travel like the features of uh storage layer provide to roll back to the previous version if something goes wrong right the worst case scenario is not when the job fails the worst case scenario is when the job run produce bad data and you provide it all through your stack right so that's kind of like the kind of thing you can track through that so so what's the result of the project been so far um how are you measuring success or are you aware of how many folks are adopting um the core principles behind open lineage so far yeah so the the progress so far so we defined the initial spec we started defining uh facets so you have we have some core facets that are defined and marquez of course is a reference implementation so anybody will want to consume open lineage can use markets for this we have integration so it's kind of the goal eventually is to be pushed in every project as a dependency and you know it's like just adding an interface right and you can uh emit information to it and it's not like it's relatively easy to integrate because it doesn't add any other extra dependency to your project right it's just an api with no dependencies and um it's like a logger really like perspective it's like tracing and so we have integration at the moment for airflow for bigquery for spark for a snowflake uh great expectation uh and there's a prototype of like dbt uh integration on the consumption side i think the most advanced project is the aegeria project so ageria is this open metadata project and so they'd be working on also being able to consume open image and be able to also ride it so it's kind of edgier is a bit like a metadata bus that lets you kind of push your metadata to other things so it kind of makes sense to be able to consume and produce open image and connect things together as well um oh and and the other aspect of facet um that i didn't mention so one of the goal is to remove friction from the project right like so avoid uh um having tightly coupled decisions right like decoupling facet decoupled decision making and so we can move fast on things that are less controversial and things that are more controversial kind of more focused discussion and conclusion and the other thing we added is this notion of custom facet so actually anybody can attach a facet that is not part of the spec and publish their own definition of a facet and the idea is it really it makes it really easy for people to experiment and like create their own facet without having to go through getting it approved by the project like you can really make your own facets and and be able to uh to add to it and i think there are two values to this one is kind of really unblocking people as much as much as possible they can experiment with that to ask for permission for anything and second it's a great way to collect what people are doing with it to kind of start bubbling up right under the way to represent certainty the actual spec right yeah so um that's that's really cool those are some really interesting um design principles um that's uh yeah that's that's that's great julian um it sounds like a really interesting project i'm curious like if if somebody wants to dip their toe in the water with open lineage like what's the quickest easiest way to get started because sometimes i think the fear behind these kinds of um projects that have like such a strong sort of philosophical or you know opinionated bent is like do you have to commit whole hog to it to actually get value from it um so like what's what's the what's the best way for people to think about that and that trade-off like what does that look like for open lineage what's the smallest way to get started so i think the easiest is probably to uh stand up a marquez instance right and start um collecting open image so you can use it in existing integration so if you use airflow if you use uh spark if you use bigquery snowflake you can start like uh hooking that into your airflow and setting up an image event to a marquez instance for example and see the lineage and marquez also has a lineage api that you can script it as well right you can query the rest api and there's a graphql api for people who like this kind of thing to kind of explore the lineage or build on top of it and if you want to integrate things that are not integrated yet right like so there's a json schema spec uh that is relatively easy to use and um we we have a python client and the java client in the works which makes it easy so depending on if you're interested in introspecting what you have and see see in markets what this looks like or whether you're interested in making your own integration right it's all about producing json events uh and there's a clients that help you doing that right so once you have json schemas really json schema is the underlying object representation for open api for people who are used to open api and generating code for models and exposing like this uh it's relatively easy way um to push the model right so we we have a python client that's based on that and a java client um that lets you easily produce your metadata in that space got it and of course people can find full documentation on github and they can get in touch with you you have the illustrious j underscore handle on twitter um that we always hear about in the community so i'm sure this are a slack group for open lineage or any community like that the best way to reach out is probably to join the uh slack uh channels so if you open lineage github um you'll find a link to the slack right open image slack open lineage twitter and all the open lineage uh github organization so it's relatively easy to to find so yeah please join the uh the slack feel free to ask questions um and uh and uh yeah we're very happy to help and answer questions and there's some some focused um discussion uh channels as well in there okay okay great yeah that's good to know well julian we're almost out of time but i just wanted to ask you a final question um which is what are the big insights that you've learned over your career working with data um i think like working on those open source maybe i'll reuse some of those um analogy you know like engineers like to reverse engineer stuff more than talking to others or like spending time figuring out and so i think that's kind of one of the key finding and that open lineage is a people problem more than a technical problem it's about getting together and agreeing on some key principles and like making it so much easier to build these things right so i think it's there's this notion of stone soup right i don't know if you know these stone soup children's story like someone comes from a village and they go to on the on the main town uh square and they have a big pot of hot water and they put a large stone in it and they start steering and like people ask them what are you doing i'm preparing a stone soup for everyone but you're welcome to run your own ingredients so basically they're making a soup of nothing right it's just hot water and the stone is just like um a dummy thing to say look i put an ingredient and start steering it so i'm just enabling the community right and people bring their own ingredients like okay let me add carrots to it let me add leaks to it whatever and then start steering and the more people contribute to it the more this thing becomes better for everyone so it's kind of you know that's what i'm doing i'm steering the pot and i think there's so much you can achieve by enabling others right how do we get together how do we own this as a community how do we make it make sense for all of us it's kind of you know building this open source momentum right like and the more people contribute the more there's value into it the more people contribute and you know you've seen that happening for parque and you've seen that happening for arrow and yes it takes years but it happened right and once it started it's kind of unstoppable right like uh you know today if i wanted to stop parking i you know like i couldn't how yeah there's nothing to do yeah well that's that's really well said poetic even and it's been great to um collaborate with you um through data council over the last few years we've definitely appreciated uh you you coming and sharing in forums like this and that are our live conferences so it's been great to get to know you julian and um i appreciate you stirring the pot i hope that you you keep it up you're right that engineers aren't always good at sort of building um talking to each other and and sort of using collaboration to build standards like this so yeah thanks for all your work in the community and um it was really great to have you chat with us today you're welcome it was a pleasure thank you have a great day and uh to the rest of the community thanks for joining us um please subscribe to the youtube channel so that you can stay abreast of some more amazing dc thursdays episodes that we have coming up across the summer so take care and we'll see you soon
Info
Channel: Data Council
Views: 535
Rating: 5 out of 5
Keywords: data engineering, data pipelines, data lineage, open data lineage
Id: SRAzJalG0YM
Channel Id: undefined
Length: 56min 35sec (3395 seconds)
Published: Thu May 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.