Data Mesh Paradigm Shift in Data Platform Architecture

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you all for coming here right before lunch everyone's hungry well I'm gonna do my best who gets your mind off food and lunch for the next 50 minutes talk about data mesh long overdue paradigm shifts in data architecture I know I did resist using the phrase paradigm shift I couldn't resist it ended up in the title and it's one of the most used and abused phrases in our industry right have you heard the phrase yeah yeah you have I know do you know where it comes from the origin of the phrase yes thank you very much you're one of the very few people who actually know the origin of this the other person who knows the origin of this knew the origin of this was our CTO Rebecca Parsons as you rightly said in 1962 an American physicist and historian of science a philosopher of science wrote this book the structure of scientific revolutions and he coined the term paradigm shift in this very controversial book at the time he made actually quite a few did a few scientists upset what he shared in his book was his observations about how science progresses in through the history what he basically said was you know scientists start their journey in terms of progressing science in this phase he called normal science where essentially scientists are working based on the assumptions and theories of the existing paradigm the looking and doing observations to see what they expect to see what they expect to prove so not a whole lot of critical thinking is going on there and you can imagine why scientists weren't so happy about this book and after that they start running into anomalous they're making observations that don't quite fit the current norm and that's when this they go into the mode of you know phase of crisis they start doubting what they believe to be true and they start thinking out-of-the-box and that's where the paradigm shift happens to the revolutionary science essentially we're going from incremental improvements in that you know whatever scientific field we are to a completely new order and example of that when scientists couldn't make sense of their observations in sub atomic level we had the paradigm shift from the Newtonian mechanics to quantum mechanics what does that have anything to do with modern data architecture I think we are in that crisis phase in the in the kuhnian you know observation the paradigms that we have adopted for 30 40 50 years about how to manage data doesn't really solve our problems today the inconvenient truth is that companies are spending more and more on data this is a survey animal survey a new Vantage from you know top thousand at the forty thousand companies and they surveyed the leaders on what they found out is that we're seeing an immense amount of increase in the pace of investment that increase over the course of one year is budgets that are being spent between 50 million to five hundred million and above despite the fact that the leaders in those organizations seeing a downfall in their confidence as those that money is actually giving results in you know measurable results and even though there are pockets of innovation you know in terms of using data and we don't have to go far we just look around Silicon Valley where we see how you know digital natives are using data to change their businesses with but the incumbents and a lot of large organizations are failing to measuring themselves failing on any transformational measure are they using data to compete are they using analytics to change their business have they changed a culture and while I don't want to you know like underestimate the amount of you know work that goes into multi faceted change and transformation organization to actually change you know use data to change the way we behave changing our culture changing our incentives structure changing how we make decisions but technology has a big part in it and this is an architecture tracks so that's where I'm going to focus the current state is that or the current accepted norm and paradigm kind of has put this architectural landscape in into these two different spheres with hardly much intersection so we have the sphere of operational systems right that's where the micro service is happening that's where the systems running the business are operating your e-commerce or a retailer your supply chain and really we've seen an immense amount of improvements over the last you know decade in how we run our operational businesses right you just have to go to micro services track or DevOps track to see how much we have moved forward and then on the other side of the organization down the hall in the data department we are dealing with the big data analytical architecture which deals about with its purpose is how can i optimize the business how can I run the business better so I can upsell cross-sell personalize you know experience of my customer find the best route for my drivers see the trends of my business you know bi analytics ml and that's that has a very different kind of architectural patterns and paradigms that we have accepted so if you think about that sphere of our big data architecture the these three big generational technologies that I see in the while working with a lot of clients start with data warehousing do you know when was the the first you know writing and research and implementation of the data warehouse enter data industry any idea how many years ago four years ago close in the seventies lasix is the first research papers and the data marts and implementation of that were in the 70s so we had the data you know warehousing we improved in 2010 we evolved to data Lake and now Ladyland data Lake on the cloud so if you look at kind of the implementation or the you know the existing paradigms of data warehousing so the job of data warehousing has been always get the data from the operational systems whether you run some sort of a job that goes into the guts of database and extract data and before you use the data try to model it into this you know model that's gonna solve all the problems and world hunger and we can do all sorts of analysis on it into you know snowflake schemas or star schemas and run a bunch of sequel like queries over it so we can you know create dashboards and visualizations put a human behind the system put on the analytical system to see what the heck is going on around that business and the type of technologies that we've seen that the space by the way disclaimer this is no endorsement of any of these technologies it's just a random selection of things that you might see in the world as a representative technologies to support data warehousing right you have the the things like the the you know cloud providers like their bigquery or power bi if you're an insurer that gives you the full stack to get the data into you know hundreds of tables and be able to query them in different ways and then you have your kind of dashboard and analytics on top for your reporting but the data warehousing which we use for about forty years had been problematic at scale the this notion that we can get data from all different complex domains and put them in one model thousands of tables and thousands of reports and we can really use that in an agile and you know nimble way has been an unfulfilled promise so we improved we evolved and we say you know what don't worry about that whole modeling we talked about just get the data out of the operational systems big bring them to this big fat data lake in its original form and don't do so much modeling we deal with modeling afterwards and then we have you know we throw a few data scientists the swimming this data lake and figure out you know what insights they can discover and then we would model the data that for you know downstream consumption in a food for purpose way whether it's like specific databases or a data warehouse down there down the line and that has also been problematic at scale I mean there depart the data department of running Hadoop clusters or other ways of you know storing this big data hasn't been that responsive to the you know to the scientists that need to use that data and the type of technology that we see around here you know the big storage like the blob storage because now we're talking about storing data in its native format so we go with a you know plain blob storage we will have tools for processing the data spark and so on to join to filter to model it and then we have orchestrators like airflow and so on to orchestrate these jobs a lot of clients that I work with you know they they still are not satisfied they still don't get value at scale in a responsive way from data Lake so naturally the answer to that is get the lake onto the cloud right cloud providers are you know speeding and competing in getting your data in the cloud and provide you know services that are easier to manage and and and they're doing a great job but they're essentially they're following the same paradigm this is an example from a solution sample solution example or solution architecture from GCP I can promise you if you google AWS or as your they pretty much look the same so you've got on the left-hand side this idea that your operational systems the old teepee everything through batch through the stream processing throw it into the lathe data Lake and the downstream model it into bigquery or you know BigTable if you want to be faster and so on and that looks wonderfully convincing right wiring fabulous technology to shove the data from left to right into this big cloud but I want to step back for a minute look at you know 50,000 foot view the it sends essential characteristics that are commonly shared across these you know different solutions that we've built and get to the root cause of why we're not seeing the benefits that we need to see so 50,000 foot view I can promise you that I've seen so many enterprise data architectures that pretty much look like this obviously they drawn with more fancier diagrams that's rather than my squiggly hand drawing but essentially it's a big one Big Data Platform data Lake data warehouse and its job is consuming data are from you know hundreds of systems the yellow orange boxes that I drawn across the organization or beyond the bounds of the organization cleans process serve and then satisfy the needs of hundreds of consumer use cases you know feed the BI reports empower the data science is train the machine learning algorithms and so on and that if you look at that technology you know that the solution architecture that I showed you there is nowhere a discussion around the domains around the data itself we always talk about throw the data in one place so this idea of this monolithic architecture this the idea of domains the data itself is completely lost the job of the architects in organization when they find themselves with this big you know kind of architecture is to somehow break it down into its pieces so that they can assign different teams to set you know to implement the functionality with between different boxes here and this is one of the ways that companies at scale are trying to break down architecture their architecture into smaller pieces so they design ingestion services so services that are you know getting the data out of the devices or you know operational systems they have the processing team that is building the pipelines to process that and there are teams that are working on the api's or downstream databases to serve them so that's pretty much and I know I'm very much simplifying behind this is actually a labyrinth of data pipelines stuck to stitch together but when you step back for a minute what are we seeing here we're we seeing a layered architecture that it's been a top-level decomposition it's being decomposed based on its technical capability right serving ingesting and so on the boundaries are the technical functionality and if you tilt your knothead 90 degrees you have seen this before right we have seen layered Enterprise Architecture where we had UI and business logics and you know databases underneath and what was it what was wrong with that we move from that to micro services and why because the change doesn't happen the change is not constrained to these boxes that we've drawn on the paper the change happens our shogunate lis to these layers if I want to introduce a new signal that I want to get from my you know my device and now process it if I want to introduce a new source or a new interest in their model I pretty much have to change all of these pieces right and that's a very friction full process the handover the handshake you know make sure make sure that consistently happens that close those layers and if we come down a little bit more closer and look at the life of people who actually build this architecture and support them what do we see we see a group of people siloed they're engineers ml engineers in the middle stuck in between the world of operational systems that generate this data and the world of consumers that needs to consume the data without any domain expertise right I really don't envy the life of data engineers I work with and I'm hoping that we can change the life of these data engineers right here right now and you know from here on what happens is did you know that the the the red people the orange people that are running the operational systems they have no incentive to provide their data their analytical data right those historical snapshots the events and reality and facts of the business to the rest of the organization in an easily consumable way they are incentive to run their operational business right their incentive to run that e-commerce system and build a database that is optimized to run that e-commerce system right on the other side the purple folks they are just hungry for the data they need the data to train the machine learning and they're frustrated because they constantly need to change it and you know modify it and they've dependent on the data engineers in the middle and the data engineers are kind of you know under a lot of pressure because they don't understand the data coming to them they don't really have the domain expertise they don't know how the data is being used they've been essentially siloed based on the tools expertise and yes we are at that point of you know kind of evolution or the growth growth of technology that is still the data tooling is fairly niche space like you knowing spark and Scala and you no airflow it's a very niche space than you know generally software engineers and we've seen these hot silence before right we saw the silo of DevOps and we removed the wall the wall came down and we brought the folks together and we created completely new generation of you know engineers called emissaries and that was wonderful wasn't it which silence would just have a very difficult and process full of friction a little I guess the stats just to show the skillset gap that we are facing and will continue to face with the wall in between is the stats that you can get from LinkedIn last time I searched was a few weeks back I doubt things have changed much in three weeks but essentially if you look for a data jobs open today for the label data engineer you find about for this 6,000 jobs open on LinkedIn and if you look for people who are claiming to be data engineers on the platform you see 35 37 thousand folks and I'm pretty sure all of them are you know in good jobs with good pay so there's this huge gap in the skill set that we can't close with just styling people so essentially this centralized monolithic paradigm it was great maybe for a smaller scale but the world we living today is a world that data is ubiquitous every touch point every action and interaction is generating data and the business are you know driven to innovate right that cycle of innovation test and learn and observe and change that requires constant change the data and modeling and remodeling and this centralized system simply doesn't scale a centralized monolithic system that has divided the work based on the technical operation implemented by a silo folks so going back to Thomas Kuhns observation had the data warehouse and the lake and the lake on the cloud would have been doing for 40 and 50 years we've been stuck in that normal science right we believed that the only way we get use of data is just getting into big fat data lekha platform get our arms around us so we can make sense sense of it right that this centralization was the dream of his CIOs of 30 years ago that oh I have to get the data centralized because it's siloed in these databases that I can't get into and that's the paradigm shift I'm hoping to introduce so let's talk about data mesh hopefully so far I've kind of nudged you to question right question the existing paradigm so I'm gonna go a bit top-down on that kind of my mental model is fairly tough down talk about the principles that drives this change and then go deeper into some of the implementations and hopefully leave you with a couple of you know next steps so the principles of data mesh that underpinning the image are basically the ingredients of the best and most successful projects that we have had globally at ThoughtWorks and it's applying the learnings of modern architecture that we've seen in a jacent world of operational system and bring that to the data the very first one is this decentralization how can we apply domain-driven thinking and distributed architecture to data how can we completely hide the complexity of that cell infrastructure that runs and operates the big data and I don't want to trivialize that that's it is very hard to operate a caf-co cluster at scale it is very difficult you know to run your spark cluster so how can we abstract that away into self-serve this infrastructure with platform thinking and to avoid those silos of hard to find hard to use meaningless you know not trustworthy data how can we apply product thinking to really treat data as an asset and finally to have an harmonious and well-played ecosystem what sort of governance we need to we need to bring to the table so I'm going to go into each of these ones by one one by one and get a little better sorry can you hear me okay domains we've been distributed architecture raise you of Eric Evans DDD 10% so go on Amazon or just do this and get the book or go to dddd I hate this I think they're into website that's take I can't move it but I think it's actually from here and I don't know if a tree or something else so essentially what domain driven design or domain driven kind of distributed architecture thank you sure introduces this idea of breaking they're kind of monolithic systems into pieces that are you know designed around domain sir and put this down and next time I disconnect Oh pick that up hopefully so picking you know the business domains that you have so right now what we discussed was breaking down you know the way we're trying to break down this centralized monolithic data platforms around pipelines right the job of the different pipeline now we are applying a different kind of approach we're saying find the domains so they put up here are from health insurance because that's where I am what kind of waist deep right now with a client implementing their next generation data platform you when you think about the operational domains a lot of organizations are already divided that way so you've got in the you know in the healthcare domain you have your client systems that provides claims like pharmaceutical or medical claims that you're putting together you might have your bio lab results and so on and then you see so these are kind of the different domains you see in this space right and if you think about these data domains as the you know as a way to decompose Eli's your architecture you often find either domains that are very much closer to the source so where the data you know originates in this example for example you know claims you have the claim systems already either accepting or rejecting or processing different claims so those systems are generating essentially you know historical analytical data about claims so there are domains that are closer to the facts of the business as they getting generated we're talking about immutable data we're talking about historical data that is just gonna be infinitely forever and ever generated and stay there and hardly this this sort of kind of data domains hardly change because the facts of the business don't change as much of course they're you know industries where you know I get a new app and my app features changes so the signals coming from that app constantly changes but normally in bigger organizations these are more kind of permanent and static data domains and then you have domains that you are refining you're basically creating based on the need of your business so these are aggregate data domains in this example I put in the example of patients critical moments of intervention which is a wonderful actually use case and a data set that the client that I'm working with right now is generating by aggregating a lot of information about the members members behavior their demographic the change of address and fine apply machine learning to find out what are those moments that they as an insurance provider I need to reach out to my members and say hey you need to do something about your health you just change your address you haven't seen a doctor for a while you don't have a support network probably you haven't picked the doctor or done your dental you know checkups go and visit dr. AOB so creating these data sets these are these are aggregate views or you know the holy grail of health care data right now is longitudinal patient records so aggregating all of your clinical visits and lab results into some sort of a time series data so these are more consumer-oriented designed domains and theoretically we should be able to always regenerate this recreate this from those negative data products that we saw where did the pipelines go what the public still exists each of those data domains they still need to ingest data from some upstream place maybe just a service next door that is implementing the functionality or the operational systems they have to cleanse it and serve that data but those pipelines becomes the second class concerned they become the implementation details of these domain data sets or domain data products and as we go to tours the right-hand side to words my right-hand side I guess your left-hand side the orange and red blobs we see more of the cleansing more of the integration integrity testing to get accurate source of data out built into the pipelines and as you go to words you know that consumer facing and aggregate views you see more of the modeling and transformations and joints and filters and so on so in summary with distributed domain driven architecture your first partition architectural partition becomes these domains and domains data products which I go into details I did towards the end I really hope that we don't use pipeline data pipeline as a first-class concern every time I asked one who has data engineers well can you draw your architecture he just talks about pipelines pipelines are just layers implementation details what really matters is the data itself and a domain that it belongs to there's this wonderful concept architectural quantum that Neal Ford and Rebecca Parsons the the co-authors of evolutionary architecture booked term which is the smallest piece of your architecture unit of your architecture that has high cohesion and can be deployed independently of the rest so we are moving to a world that their architectural quantum becomes this domain data products essentially that are immutable that you know that showing the snapshots and the history of the business essentially but how can we avoid this kind of you know the problem that we have had to move from centralization this problem having these silos of databases and data stores now spread across these domains and nobody knows what is going on how do we get get to them so that's where kind of product thinking helps us are there any technical product owners or product owners into in the room yeah a few - I think if I account it welcome so I think it's become quite actually common for us to think about the technical platforms that we build as as products because essentially the developer is the data scientist they are the consumers and customers of those products and we should treat them so if you ask any data scientists today that would tell you that they spent 80 to 90 percent of their time to actually find the data that they need and then make sense of it and then cleanse it and model it to be able to use it so why don't we apply product thinking to really delight the experience of that data scientist and remove that 80 90 percent waste so what does that mean that means each of these domains that we talked about like the claims domain becomes the date of the historical analytical data for it becomes a product yes it has multiple shapes you know it's a polyglot data set you might have streams of claims for the users that prefer real-time or near real-time events about the claims it might have you know buckets of batch or historical snapshots for data scientists because they love archived files and you know batch processing for 80 percent of their job but for data to be an asset and be treated as such I think is there some characteristic that each of these data products need to carry first and foremost they need to be discoverable I mean Chris mentioned in his pre in the previous talk that you know that with a world of data cataloging and Sunfire in a good way and there are like tons of different applications because data discoverability is the first and foremost you know characteristics of any healthy data platform once we discovered it and we need to be programmatically addresses so we get we get access to the data easily if I can't trust as a data scientist data analyst if I can't trust the data I will not use it and it's really interesting because in the world of api's and micro services micro-service without announcing your uptime and having an SLO it's it's crazy right you have to have you know an understanding of what your commitment to the rest of the organization is in terms of your SLO s why can we do apply the same thing to the data if you are and it doesn't you know if you have maybe real-time data with some missing events and some inconsistencies that's acceptable you just got to you know explicitly announce that and explicitly support that - for people to trust the data they're using good documentation this description of the schema where the owners are anything that helps the de scientists or data users to self-serve using your product interoperability if I can't not have distributed the data if I can join the customer from the sales domain to the customer from the commerce domain I really can't use these pieces of data so that interoperability - unifying the IDS or some other you know failed formats to allow that join and filter and correlation is another attribute of a data product and finally security and it's such a privilege to talk after Chris here because I can just point to his talk so he talked about our back and you know applying kind of access control in an automated way at every endpoint at every data product but this sort of things happen just don't happen out of good intention we need to assign people with specific roles so a particular role that we're defining when we're building this data products are data mesh is the data product owner so someone whose job is care about the quality the future the lifecycle of a particular domains analytical data and really evangelizes to the rest of organization come and see I've got this wonderful data you can tap into and show how that can create value so in summary treating data bringing the best practices of product development and product ownership to data and if you're putting one of these cross-functional teams together with the data product owner I will start with asking for one success criteria one KPI to measure and that is you know delighting the Experian the data users the decrease lead time for someone to come and find that data make sense of it and use it and that would be the only measure that I track first and then of course the growth and the more more number of users using it and if you've been listening so far you're probably wondering what are you asking me and this is a question that a lot of like CIOs and people that I actually spend the money asked me is that you telling us distribute the data ownership the analytical data ownership to different domains create different teams and then what happens with that all that kind of technical complexity the stack that needs to implement each of these pipelines right each of these pipelines needs some sort of a data like storage you need to have the storage account setup need to have the clusters to run their jobs probably they have some services there's a lot of complexity that goes into that and also decisions such as you know you want to have your compute closer to your data you want to have a perhaps a consistent storage layer so these sort of decisions if we just distribute that we create a lot of duplication rights duplicated effort probably inconsistencies and that's where you know our experience in the operational world to creating infrastructure as a platform comes to play we can apply the same thing here so you know capabilities like the data discovery setting up you know the storage account all of those kind of technical the metal work that we have to do to spin up one of these data products can be pushed down to a self-serve infrastructure with a group of data you know infrastructure engineers to support that so just to give you a flavor of type of complexity that exists that needs to be abstracted away here's just some lists out of this list if I had a magic wand and I could ask for one thing was that unified where is it unified access - yeah unified data access control so right now it's actually a nightmare to set up a unified policy based access control to different mediums of storage right if you're providing access control your buckets or if your on is your idealist versus your Kafka versus your relational database every one of them has a proprietary way of supporting that and there are technologies that are kind of coming to play to support that like you know extensions future extensions to open a ap open policy agent and so on so there's a lot of complexity goes into that in summary the platform thinking or you know data infrastructure self-serve data infrastructure is set up to build all of the domain agnostic complexity to support data products and if I set up one of these teams and we often very early on in the projects we set up we set up a data infrastructure team they ask for them and the matrix they get measured by is the amount of time that it takes for a data product team to spin up a new product so how much complexity they can remove from the job of those data engineers or they you know the data product developers so that they can takes a very little amount of time to extract to get data from one for one domain and provided in a polyglot form to the rest of the organization and that's their measure measure of success anybody who's worked on distributed systems know that without interoperability the distributed system will just fall on its face if you think about you know micro services and the success of api's we had one thing that we all agreed on we had HTTP and rest and we pretty much all agree that's a good idea let's just start getting these services talk to each other based on some standardization and that was the key to the you know revolution of api's and we need something similar here when we talk about this kind of independent data products that providing you know data from different domains so that that data can be correlated joined and processed or aggregated so formulating what we are trying to do is creating this excessive kind of folks that are coming from different domains and formulating a federated governance team to decide what are those the standardization we want to apply and of course they're always going to be you know if one or two data product that are very unique but most of the time you can agree upon a few standards the areas that we are standardizing first and foremost is how each data product describe itself so that it can be self discovered so the AP is to describe a data product which I will share with you in a minute to you know to fine and describe it a data product and discover it the other area that we very early on work on is these federated Identity Management in the world of kind of domain driven design there are often entities that cross boundaries of domains like customers one of them members one of them and every domain has its own way of identity this what we call poly seems so there are ways to build kind of inference services in a you know user machine learning to identify these you know the identity of the customer across the different domains that has a subset of attributes and generate a global ID so that now I can as as part of publishing a data product out of my domain I can do internally a transformation to a globally identifiable you know customer ID so that my data product is now consistent in some ways with the other you know data products that have the notion of customer in them and most importantly we try to really automate all of the governance capabilities or capabilities that are related to the governance so the federated ID system management is one of them access control is another one how can we really abstract away the policy enforcement and policy configuration for accessing polyglot data into the into the infrastructure so let's bring it together what is data mesh if I can say that with one breath in one sentence a decentralized architecture where your units of architecture is a domain driven data set that is treated as a product owned by domains or teams that most intimately know that data either creating it or they are consuming and re sharing it and we allocated specific roles that have the accountability and the responsibility to provide that data as a product abstracting away complexity into infrastructure layer a self-serve infrastructure layer so that we can create these products more much more easily all right ready for a real-world example okay so this is a real-world example from the health insurance domain so on the top corner we see the domain we call it call center claims so it happens that these organizations have been running for you know 50 years and they usually have some sort of illegally legacy system so this is the online call center application that is a legacy system the owners and writers of it are no longer with us so we had no other option but running some change data capture as an input into a data product that we call online call center that's running within the domain of online console sure it's not something different and what this data product does is provide the call centers you know claims daily snapshots because that's the best representation of the data from that domain from the state of that legacy in the other corner of the organization we have this like brand-new micro services D handling the online claims information right you have a micro service it's new the developers are sharp and they're constantly changing it so they're providing the claims events as a stream of events so we bundle a data product with that domain within that domain called you know online claims data domain that now it you know gets data from the event stream for the claims and provides to you know polyglots data data output essentially one is similarly the events that it's getting it does it does a bit of transformation you know it unifies the idea IDs and a few different field formats that we agreed upon and also it provides for data science it provides you know parking files in some sort of a data Lake storage but a lot of the downstream organizations they don't want to deal with duality if you know whether it's online or whether it's data center well so we created a new data products we call it just the claims data product and it's consumes from upstream data products ports the from the online and from the call center and aggregate that together as one unified stream and obviously it provides a Streamy wants to still maintain the real-time Ness of the online so the events that it gets generated are actually for for the for the legacy system is kind of synthesized from the daily changes so there they're not as frequent but we also have the snapshots so we have now the claims domain and we can play this game for ever and ever so let's continue you've got the claims the other side of the organization you've got the members people who deal with registration of the new members change of their address change of their you know marital status and so on and they happen to provide member information right now as you know buckets of kind of file based information and we had this wonderful you know ambition plan ambitious plan to use machine learning to aggregate information from claims from members and a bunch of other upstream data products and create a new data product that can provide to the the the staff information about members that need some sort of an intervention for a better health essentially less claims and less cost for the insurance company and that downstream data data products their member interventions data product runs actually a machine learning model as part of its pipeline so these are you saw in the previous diagram kind of these ones are more more going back native and closer to source data product and as we move towards this you move towards kind of aggregated and new models and consumer oriented one of the questions there are puzzles for a lot of the new clients this is what is this data product what does it look like we can really you can't really understand what it is because because the mental model we're kind of like inverting the mental model right the mental model has always been you know upstream into the lake and then lake converted into kind of downstream data systems is very much a pipeline model it looks like this looks like a little bug this is your units have architecture and I have to say this is the first incarnation of this right we've been building this now for a year but it's hopefully you will take it away you make it your own make your own bug and and build a different model but this is what we're building so every data product that I just showed like the claims online claims and so on it has a bunch of input data ports that gets configured to consume data from upstream streams or file dumps or CDC or api's they can be you know depending on how they're consuming the data from the upstream systems or upstream data products they have a bunch of polyglot output data ports again it could be streams that this is where what the data that they're serving to the rest of the organization it could be fast it could be SQL kind of query interfaces it could be api's it could be whatever makes sense for that domain as a representative of its data there are two other kind of lollipops here what we call control ports essentially every data product is responsible for two other things rather than just consuming data and providing data and the first one is be able to describe itself so all of the lineage metadata information addresses that these ports the input ports are output ports that people care about schemas everything comes from this endpoint and if there is a centralized centralized you know discovery tool they would call this endpoint to get the most you know latest information and because of the gdpr or CCPA or you know some of the audit requirements that usually the the governance teams have in the organization we provide are also an audit port if you think about unprimed to clout movement that happens in the trip if your upstream happens to be on Prem and downstream happens to be on cloud the conversion from the copying from on-premise happens in the ports configuration and if you look inside you will see a data pipeline a bunch of data pipelines in fact that they are really copying the data around to prevent transforming it and snap shutting it or whatever they need to do to provide a downstream and we also deploy a bunch of services or site cars with each of these units to implement the api's that I just talked about the audit API or the self description API as you can see this is quite a hairy little bug in micro-services world is wonderful like you build a docker image and inside it it has all the complexity of implementing its behavior year we every see ICD pipeline which is the CIC a pipeline independent for every data product actually you know deploy a bunch of different things so for example on edge or on the input data ports are usually you know as your data factory which is like the data connectors to get the data in the pipelines or as data breaks spark jobs the you know storage is IDL s there's a whole bunch of things that need to be configured together as a data product apologies for people in the back of the room I will Gwen will share the slides but essentially discoverability is a first-class concern you can get you know hit an end point restful endpoint to get the general description of each of these data products you can hit an endpoint to get the you know the schemas for each of these output ports that you care about and documentation and so on where is the lake where is the data warehouse where is it oh it's not really on this diagram so I think the data warehouse as a concept like having bigquery for you know having if you if your BigTable for running fast sequel queries it can be a node on the mesh the lake as a storage again you can still have a consistent kind of storage underneath all of these data products but they're not a centralized piece of architecture any longer the paradigm shift we're talking about is from you know centralized ownership to decentralize ownership of the data from monolithic architecture to distributed architecture so really thinking about pipelines as a first-class concern you know - domain data as a first-class concern data being a byproduct of what we do as an exhaust of an existing system to be a product that way self and we do that through instead of using you know having siloed data engineers and m/l engineers who have cross-functional teams with data ownership accountability and responsibility and with every paradigm shift needs to be language shift and a language change and here are a few tips as how can use a different language in our everyday conversation to change the way we imagine data to change the way we imagine data as this big divide between operational systems - analytical systems so as one you know piece of architecture that is fabric of our organizations and hopefully now I've kind of nudged you to question the status quo to question the paradigm of 50 year paradigm of centralized data architecture and yeah get started implementing this decentralized one this is a blog post that to be honest I got really frustrated and angry and I wrote in a week so I hope it helps if you have any other questions feel free to reach out I have also my colleagues here Scott Davis down here who also gave a talk and Jared somewhere here we are also hiring so if you want to implement and spearheads the database paradigm shift feel free to come and talk to us thank you [Applause]
Info
Channel: InfoQ
Views: 37,873
Rating: undefined out of 5
Keywords: Data Mesh, Artificial Intelligence, Machine Learning, Service Mesh, Software Architecture, Microservices, Data, Data Science, InfoQ, QCon, QCon San Francisco, Transcripts, Paradigm Shift
Id: 52MCFe4v0UU
Channel Id: undefined
Length: 48min 7sec (2887 seconds)
Published: Tue Mar 03 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.