The hardest part of microservices is your data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right its 10:15 I think we'll get started it's pretty loud you guys can hear me really well hopefully that's okay the only problem is going to be I had a cold last week and I'm still trying to get over this terrible cough that I have so hopefully when I cough I don't about your eardrums we're gonna get started and I hopefully put enough buzz words in the title to get you guys come in here how many people are looking at doing micro-services style architectures okay you came to the right talk I think this slide deck so I have only 45 minutes and I'm going to try to I'm kind of a bit crazy because I have so many slides and so many things to say on this topic I am also going to try to do a demo and we try to cram this into 45 minutes but this talk is actually much much longer the full slides are here and you can go here the full deck is about 90 or 100 slides the deck that I'm gonna use today is like 50 and there's no way how much I couldn't be able to do that unless I keep a really good pace so I might go a little quick if the at the end I'm going to try and leave time for questions but I do plan to after this clock go out I'll back and then we can have have questions afterwards okay so let's get going so first real quick my name is Cristian I'm a principal architect at Red Hat I spend a lot of time working with our customers in North America and helping them understand how to build resilient distributed systems my background is in integration and messaging I spend a lot of time at Zappos calm and their parent company Amazon and got to see this was maybe not five years ago what DevOps and micro services although we didn't call it that back then but what a mature high-performing organization looks like and understand some of those principles and how to guide enterprises that don't look like Amazon but what that journey might look like for them I'm also committed on a bunch of open-source projects and love talking about this and I recently wrote a book I guess it's not recent anymore it's about a year ago micro services for Java developers or if you go to developers at comm you can get a free electronic copy and downstairs at the booth I think they have hard copies so very good now everyone's talking about micro services and we've been talking about that for the last three years now and I think we've come to this point where we talk about it so much that everyone just thinks well if we just follow these things if we just check the box and all these things break things up smaller or whatever I had a single database for micro service or whatever then we'll be doing micro services the reality is it's more of a it's more of an evolution or a journey that teams and the organization with its structure and its culture and so on that they take to be able to go faster that's the ultimate goal not doing micro services or SOA or whatever any of these technology we get we had developers especially how many developers in the room yeah the developers we do get I get at least distracted by all of the really cool technology that's coming out but it's it's definitely going to be this journey is going to be an organization specific journey Netflix did it their way which is different than how Amazon did it and different how Google did it just like any of our enterprise companies are going to have different challenges going down this path but like I said the coffee on me like I said micro service is about optimizing for speed how do we go faster how do we do faster deployments how do we get feedback loops and understand whether or not what we delivered into production is what we think it is and is worthwhile we want to reduce the amount of time we have between starting a project and delivering it and getting whatever value that we can out of it so we want to optimize for for speed how does your company go fast how does any company go fast there's layers and layers and layers of complications there but I think putting all those aside because those are different talks talking about managing dependencies between the different components in your system and even reducing those those dependencies where you can is is that key it's also easy to say but that is the key to being able to be more autonomous and make decisions independently and go faster and I think data is one of these major dependencies how systems share data how they interpret data these you know there's there's all kinds of intricate implicit dependencies between the teams and systems that if it's implicit then you don't really you're not as motivated to go and solve for some of these or you don't see it and you don't and it's not very clear but before we start talking about some of these issues we would I want to define what is data anyway what are we talking about when we say data and I think what what does help me understand it a little bit better is describing it as how do because when we talk conversationally between humans we can we can have a conversation about certain concepts and on-the-fly disambiguate some of these things right but the data that i'm talking about is a conversation with another human but through the computer we're going to try to explain these concepts to the computer first and then hopefully we're going to read it back and some other human might be able to interpret it that's that's the goal we're trying to explain these through conversations through the computer but the computer doesn't have this extra context that they might need for you to intelligently understand what the purpose of that data are so I can illustrate that real quick with the description of what it how even if you take if you go go through this exercise and you just try to describe what a thing is what order one of these concepts that we want to describe that we want to tell the computer so other people can go back and see I use the the the concept of a book maybe we're a library or a eerie tale or something but we want to model what what is this idea of a book now how do you describe what one book is title cover pages I don't know but you can you can very quickly get into a situation where you describe a book as one thing and then you say well in my my exam I've only written one book but there's multiple copies I could have multiple copies of the book how do we describe that in the system other people have written multiple books each one of those is a book - and so are all the copies but sometimes people write books that are so big they're broken out into smaller volumes now is each volume a book or is the whole thing a book or a newspaper has a cover in words and so on that is that a book how where do we draw that line how do we describe that really depends on who's asking the question what is the context associated with the question and in this case we could easily say that a book could be described in different parts of our system differently so a book checkout and ordering system might want to know about every single copy of the book and there's probably more metadata associated with that than then let's say the title search engine where we just really care about titles or maybe the recommendation engine where we don't really care about titles per se or authors per se but maybe we care about abstract metadata like how how different categories or different topics relate to each other and and so on so we have these these different boundaries that start to naturally arise depending on who is talking about these concepts even though these are shared concepts and potentially shared across all of the all the systems and I think that's where practices like domain driven design come into the picture because although I just used a very simple example of a book our domains in our enterprise companies are far more complex far more ambiguous sometimes even conflicting and domain-driven design is and the patterns and practices around that have come up to help solve complexity in the domain which is where we start to find these these data problems now you've probably probably thinking well that's like so to talk about domain driven design LinkedIn doesn't these Internet companies don't talk about domain driven design per se and and to me the answer is pretty pretty simple going on Twitter and posting a tweet on Twitter is far more as simple you just go on there you post update your LinkedIn profile whatever that's simple stuff our businesses our enterprises are and in our financial services companies in our healthcare systems company our insurance companies and retail and so on these are far more complex they've been around for a lot longer and it's not as easy as what some of the internet startups had to do it now the internet startups had to deal with data problems at scale so I go in and do a tweet on Twitter but now I have to show that to essentially 500 million people and sorting and linking and organizing that gets really hard but I think our enterprise companies are going to face challenges at that you know that that kind of scale too but we should not overlook the complexity in inherently in the domain and we should solve for that as well so anyway we get going now let's say we do domain driven design concepts like bounded context and context map and aggregates and all these all these patterns that have come up to help solve some of these challenges and we build out our modular system and so on and then we would put everything into a database and over time as things that would make changes to the system these boundaries start to erode and the database scheme of maybe that erodes but the the inclination at least starting in 2005 when we when we started going to no sequel and all these new data stores lists to just all we just throw away the relational database and I'm absolutely confident that the way we were doing things in sequel and the North End the relational databases was really powerful really really dare I say even awesome I know I don't have to say that but it sequel and the things that it does for you normalizing your data structures allows you to do very powerful things like queries queries against data and and even ad hoc queries or queries that you didn't even think about when you started writing this data being able to do complex joins and relationships that weren't originally thought of ahead of time that's that's pretty powerful stuff in in the alternative world in the no sequel world you have to think very very hard about what your queries are going to be ahead of time and hope that that stays true for the life of your application but so sequel normalization is actually really good stuff the second thing these is what we're determining acid having this abstraction of transactional behavior so just looking at as missa deeper for a second if the database and the underlying framework underneath can abstract away really hard problems like partial failure and partial implementations of transaction that kind of stuff if you can solve for that that's powerful developers now don't have to worry about that in their code things like isolation when you have multiple applications or multiple threads in your application trying to talk to the database and the database just says don't worry about concurrency don't worry well don't worry as much about concurrency we're going to make everything look like it happened in one nice little line that's pretty powerful durability making sure that we don't lose stuff to see anybody know what the C in acid stands for right but it's not the same consistency in the cap theorem so there's a lot of angst about look at what that see really what is that one of the database really do for you in that in that scenario and we'll come back to the cap theorem in a second but all of these things really make the developer's life a lot easier and even comfortable in some ways so the C for me stands for comfort ability stick with these conveniences as long as you can these are powerful abstractions but as you start to grow and as you start to hit some of the limits the inherent limits around these and these are inherent limits in my mind the context is a traditional relational sequel database there's some new technology like the new sequel stuff that might solve around some of this stuff but we're not going talking about that right now but we will evolve and we will grow and we want and you know bill architect our applications so that they can change faster and so we introduced this isolation and autonomy but what we're really saying here is when we start to move away from this model is that thank you database you've been awesome the last 40 years or whatever it's been you're pretty much bulletproof but we got it from here we're going to do we're going to do this stuff and then we come up with things like well micro service and Microsoft's architecture microsurgeon should have their own database now now we start going down the the path of some of the problems that we have to solve for in this model because we think if we just isolate things and isolate the data and a service owns a schema and all I said we can change a lot faster which there's some merit to that but we're also now building a full-fledged data-driven distributed system and distributed systems are not easy they're not trivial even though we might think well which is what was the big but we're a is calling B and B might call C we might have some some over the network calls but that word the network is a big problem and unless we take the network seriously I'm going to have some bad bad experiences so when I say the network I mean so this looks nice and nice and simple but what it really looks like isn't something like this and in this network these are asynchronous networks and that is key to understand this these are inherently asynchronous and what I mean by asynchronous is things talking to each other on the network don't share the same concept of time these are asynchronous in terms of time when I send a message it's not going to get there immediately I have at least to go over a wire on fiber or whatever I have at least the speed of light delay right so there's going to be a delay next I have commodity Hardware things can fail next the way our IP networks work is based on packet routing and queuing so we can have arbitrary delays randomly for note no explanation and these delays can look like failures and failures can look like the others you can't really detect what which is which so this underlying premise drives I think the rest of if we can start building these data-driven systems we have to take this part into account very seriously and I think the the following from that is that go check out this paper it was written in 2005 so another stuff that I'm saying is actually new hopefully I can try to put together in a nice a nice way but the other stuff is new we've been talking about distributed systems for a long time but the data inside our services should be treated differently than the data outside our services Pat human wrote this paper here thinking that Microsoft at the time but the point is when services communicate with each other and when they and data leaves the service a service whether that's through I maybe been published a message on a message to you or maybe I queried a service and got some data back as soon as that data leaves that service ceasing we have this very distinct notion of the data was then back to that it was current back then not right now is old right we have a time if we have time to laze when we're talking to our services by the time I get the data it could have changed so there's there's some inherent staleness built into this system but if we're talking about the data inside the service inside let's say one of our bounded contexts or one of our ubiquitous language and the domain driven design parlance if we're talking about the data inside those services those can be and should be now we know we have we have the certain level consistency and we'll come to that and we know the data is now so we have then and now we have this inherent stillness but so when we start looking at how do we build applications on top of this foundational architecture of the network and this premise the about that and now we need to build these things into our system as first-class citizens I think because some of the application level problems that we start to experience because of this oh you know sharing sharing data across these boundaries I'll try to illustrate now through through a couple of examples that you know I do run into these these examples they're not totally contrived but I'm some of them are to to prove to make a make a point but when we start doing something in our application our service like maybe we want update and address updating an address sounds simple we just go to the database and make a change and update the address but address might be a concept implemented and designing its own context in other parts of the system other boundaries other down in context so for example the shipping service so maybe this may be this is a customer profile service or something but the shipping service will want to know when a customer updates their address because what if there's stuck in in-flight that needs to be rerouted or the the recommendation engine or the ad advert engine that displays certain ads depending on where you are or the tax calculation service depending on where you are you're going to calculate tax different these systems want to know when your address changes so need to somehow tell them but if we naively just publish just just just naively publish this information without taking into account well what is the what are the consistent and guarantees am I going to get when I do this we end up with I mean I'm hoping a lot of people will recognize that what happens if we commit and then we fail and we don't even publish downstream right then we get into inconsistent state or what if we publish first and then fail and we don't commit now and we have these inconsistencies so we can do things like maybe coordinate coordination across boundaries and that you know is we I think we've historically tried to look at the look at the problem through that lens but this introduces its own set of challenges doing two-phase commit I think is perfectly fine in your own boundary where the showed that premise where it inside a single service we have now but outside of a service we have then so we don't we shouldn't try to do two-phase commit XA transactions across our boundaries that gets really hairy and come ask me later hang to it in more detail but then we do things like well we'll just call all these services individually and we'll do a dual right triple right whatever and then we end up with loud if some of these calls fail we'll just we'll just try to do compensating transactions and then you end up having to store that state and that in the transition of that state and compensating actions because what if you fail because then you need to come back and be able to properly execute the compensation kind of something like a transaction matter but even more what what if the systems weren't actually built for this kind of thing because what we're doing here is we're saying let me update you with with let's say the customer profile got updated address change or whatever let me update you with the address oh I couldn't call this one now I need to rollback change my mind what we're really saying is if you look at it this through the lens of a traditional database we're we can get read uncommitted we can get the lowest level of isolation in the database which is we never do that we can get into a situation where this address change happened and was visible when people made decisions based on this and then we said change my mind rollback and that might be okay but that is a nun often often unknown or overlooked problem that you can encounter here now another one that I see is okay cross your fingers for me the next the next problem that I see is this where we where we can do queries against a service and we may potentially be returned and undoubted list so maybe we would query for certain types of customers and that list can be crazy huge we don't we don't really know but the thing is when we get that list we want to iterate through it and enrich is the business working all right and then enrich that data somehow so we might make calls out for each element in that list we might make calls out to downstream service to do some sort of in Richard now and this might be okay if your downstream service can handle 10,000 transactions for a second forever but that's typically not a good good design pattern but what we end up with is we end up trying to design different bulk api's bulk interfaces for well maybe we'll just do we'll just get a giant chunk of hats or whatever for for all the cats or whatever but then we start going down this path well we actually need to filter out some of these so we don't really want all of this stuff we just want you know some hats for cats except these cats but then we might even just say well really we just really care about certain types of hats for these certain types of cats and we get really flying grade and every API and every service is going to implement this totally differently and pagination totally differently and it gives a big mouth and then ultimately you end up with what just and it just run the sequel for me and and we're trying to not do that another problem you might solve this through well we'll just do caching we'll just pre cache or will cache some of the some of the hats or these cats and now you have to deal with data and validation problems so what we really are trying to do across these services through updates and reads and so on is achieve some type of consistency but we started off by saying that we're going to have to deal with failures is a first-class problem in our in our architecture but this sort of sounds like distributed system series and caps captain who sort of the cap theorem couple people so the cap theorem basically says is it would be very very terse with it but basically says you have C consistency which is different than consistency and acid you have a availability and you have P partition tolerance and what that means is or what the theorem says is that you out of those three you can pick only two you can only pick consistency and availability or petition tolerance and availability or petition tolerance and consistency but really you can't pick or opt to not pick P P partition tolerance is when the networks start behaving asynchronously like we described earlier if things start to look like they fail or they do fail or they partition or whatever but you can't trade off P you get P you have to pick C or a so the cats here theorem says you need to pick I want strict consistency or a lot availability now the problem with the cap theorem is it describes consistency and available availability in the most strict definition of those words consistency for example cap theorem says is the cap theorem talks about linearise ability the most strict version of consistency but consistency models are not just the most strict there is lots and lots of different consistency models and if you go into the longer version of the slides I'll explain a little bit more what consistency models are and what each one of these is but Catherine talks about this and for availability talk to them about the most strict definition of availability but there's shades of grey and if you if you look for example of sequential consistent so strict consistency basically says if I make an update that update is visible to everyone in my in my cluster immediately there's no delays sequential consistency says we'll have this nice ordering but you might not see those changes right away there might there might be a delay but you see everything in order causal consistency you probably won't see everything in a total order but the things that should be related like so for example updating or commenting on a blog post I want to see that the blog post was created and that if there were comments that came afterward I don't want to see that in the reverse order so there's a causal relationship between those events but just those events and I want to see those in that order but across all of the blog posts I don't really care and on and on and on until you get down to eventual consistency which really just means I can read anything whatever yeah and funnily enough if any of you went to the baseball game Doug Carey who had think is at Amazon now he wrote a paper describing these different consistency models through abstract concepts dead or made more concrete in baseball became baseball if you're the umpire you might need a strict consistency or you'd probably do but you also might need you also might get be able to get away with something like read my right so as an umpire or sorry the score keep if I'm a score keeper I just need to know what is the last right I made so that I can increment the scores but everybody else they might they might see the score differently so for example if you're getting radio updates I want to see a version of consistency that increases over time I don't want to see things go backwards so I might just need monotonic reads consistency or if I'm a sports writer and I'm writing my article tomorrow or the next day or whatever then I might eventual consistency might be perfectly fine so there's different different classes of consistency but maybe we can use I take advantage of some of these shades as we start to build out these systems and here's an example of maybe updating the customer profile using a in this case we'll use these sequentially consistent cube where we make our updates and we record those in this queue and it's sequentially consistent so we'll see everything in order but a little bit later then it actually happened but as soon as data leave the service is a little bit later anyway so why don't we just why don't we just make that explicit and then the system's downstream they could they'll see all of these events in the right order and then they can process those and update their local store their local databases the way that they want to store things right so now when we get to this thing where we're talking about micro servers have their own databases and they want to take advantage of maybe their own schemas maybe they're you know there's different IO properties they can deal with these data changes across boundaries at their own time and their own way that they want on their own databases so but what we've done has gone off and we build this if you think about what what I just described here in terms of a relational database or your regular database regular database databases have this concept - they have things called views they have things called materialized views and the database is responsible for keeping all that stuff up to date for you and it works really nicely when you have when you use it but when we start to distribute this out now we're building materialized views now we're building a database across across our applications now for these types of problems there's there's a interim steps right this is not just a go from what you have today to just go to be been driven systems tomorrow I think there are interim steps but this isn't this isn't some totally crazy idea either this is what the internet companies did to build their data systems at scale and some of them even open sourced some of their technology for now building a system with more relaxed consistency in more of a stream based model was done at for example Yelp Yelp builds all the stuff around my sequel so the relational databases were able to turn the relational database into a set of streams and you know downstream systems could could react to the to the changes coming from that stream but the problem is they also built up a lot of other stuff around it that if you're going to use this you kind of have to use all of the stuff and you have to buy into well just using my sequel so that's what they that's what they built around and operationally there's as much other stuff but solely and LinkedIn did something similar they have it I think it's my sequin Postgres and then Zendesk did something similar but what I was really interested in seeing can we do this in the open-source communities can we do this in a more open source way and even though they did throw the code of Hawaii open-source but did like can we build a community around this and make it much more modular instead of here if you want to use this use everything and I think that's what we did with two bzm diet soda be themed IO is a very simple system that allows us to take advantage of building out different levels of consistency across our boundaries across their services and properly deal with a problem having multiple databases per service or database per service so what DBZ 'm is is a change data capture system it's not new CC has been around for a long time but it's an open source change data capture system that allows you to take the changes that happen in your database and take those changes and turn them into a sequentially consistent cue a stream and do this in a modular way so that we can support lots of different ways of running this so we can run this against multiple different types of databases we can run it in in a way that sort of the default canonical way of using the B's iam today is using August I'll talk about it in the next slide but you'll be using some specific technology but what if we wanted to what if we didn't want to use that specific technology what if we wanted to do it a different way our own way but so what the B's iam is is a set of database connectors that we can use we can point those at a database and what it does is read the databases transaction log so the database is already doing this that is how the database implements its own replication it create keeps logs change event logs and that's how it comes up with current state of the database and that's how you do replication but the b-team basically acts as a slave or a follower and reads the database of transaction logs turns that into a concrete stream of events that it then publishes out to some sort of cute and we have support for my sequel today that's been there for quite a while now we have recent support for Postgres we have MongoDB and I think Oracle and maybe the Microsoft one are next on our plate so specifically the technology that it the canonical way of doing it is we create these we've created these modular connectors so this is maybe a my sequel connector and we can take that connect and we can deploy it in our Java apps this Java based system or or anything really if I want to just take the connector but if you want to create a data pipeline system the canonical way of deployment is just into a framework called Kafka connect has anybody heard how many people have heard of Kafka okay how many people have heard of Kafka connect cool so coffee Connect is a framework for building data systems that allow you to in ingest data for databases or files or whatever some source of data into Kafka and do it in a highly available reliable way and also take data out so those connectors for consuming cough and putting them into like a Hadoop or whatever like down just some sort some sync application but the museum uses Kafka connect as a source where we can point to bzm add these databases and now the framework just takes care of sucking down the transaction logs parsing them and taking each record and putting them into Kafka so each table ends up turning into a Kafka stream everyone with me so far so now we can look at maybe to pick one of those examples that we saw but this the sort this helps in a lot of those examples where you end up maybe maybe we just want to pre cash or cash a lot of those hats for cats and we can just use the museum as a streaming platform to whenever change has happened in that database invalidate the catch or update the cache so we have a we know it's cache we know what things are still these are sequentially consistent we'll get everything in the right order but then wzm handles those up those updates for us and actually we have a I'm not a product manager so I can't say what how this will end up being productized or if it will but we do have a data Federation product today that can help with some of the interim steps of getting to a full fledged event-driven system but that will that I think will use to be ZM as part of its implementation for building these types of materialized views of services so what we're saying is you domain driven design get boundaries around your services ice you know focus on the transactionality and the now of those services those services will then communicate by sending events between each other across these boundaries in a sequentially consistent or other consistency model between each other and then in a full-fledged micro-service environment what you then end up with is set of services that end up being your transactional services within the data from those services ends up being published in some way - - as events onto the stream that then you can materialize how we want and you're now the queries instead of doing so Netflix talks about their API gateway and their API implementation and all that stuff and they what they're basically doing is calling all of these back-end services and doing lots and lots of joins in application code I think doing it this way where all these joins are pre computed and constantly being updated so that the queries to these views now are much simpler they're flat select statements is it is in some in in some use cases is much more advantageous so you end up with services for the data services and you end up with your transactional services and they're combined through this event event based architecture a been driven architecture and I do have demos of that whole thing but I also want to point out that the B's iam isn't just some scientific project that there are companies out there using it some of them public some of them not public even though it's just an open source project is the open source project is not a Red Hat product yet but we pay comm uses the B team for exactly this purpose and they've written about I guess go to the slides but they're they wrote they wrote a blog on it went into great detail about how they use the B's iam with my sequel to build this their micro service architecture and solve some of these data problems so I've got five minutes let me try to do let me try to do the demo and then if there's time for questions or we can we can have questions outside I'm sorry I just so much stuff to talk about but let me let me see if I can get get into a demo here can you guys see this okay in the back maybe I can try to make it a little bit bigger but so I'm going to show the bzm but the show to be easy I'm going to I'm going to show Kafka because Kafka was part of this I'm using Kafka is cute to use coffee at the zoo keeper so we're going to start up some of these some of these pieces so zookeeper came up now we're going to run Kafka we're running this on Daka so hopefully you shouldn't just work so Kafka came up I'm gonna create a my sequel database how's your fingers for things like to go sideways 12 Go Go all right cool now get now we're going to create a client to the database using the my single command line it's kind of at the bottom and we're going to use the so the the docker image had a pre-populated ADA baseman we're going to use the inventory database and we're going to show the tables we just have four tables in here dummy data illustrate the the point here let's select everything from the customers table and we'll see what records that we have in this table now we're going to start up cough the connect and division so it's so copy is running we're gonna start copy connect and we're going to point to bzm at this database that we just started we can see that there is some data in the database so I start off to connect we don't have any connectors or we shouldn't have any connectors we don't have any dubium connectors we're going to create one right now let's look at what a connector definition might look like it's JSON we point it to what the database is we're telling it to use in my sequel connector if you can see that the top hosting password port and all that stuff we can do whitelist blacklist of what databases tables and we can filter based on that so we're going to tell Kafka connect create this my sequel to be diem connector two minutes and we can see it as doing something what is doing is it's connecting to my sequel as reading the transaction logs and it's parsing them and putting each each change event HTML into Kafka as an event now we should see that our connectors there cool all right we can do this to have another time so we just I just query to get the connector it comes big back we can see that it's there everything looks fine now we're going to show up copy or we're going to lead Kafka connected and keep it running we're going to navigate away from its logs if I scroll up you can see that it's still running and that we're going to login a copy I'm going to list the topics that we have in Kafka we can see for each one of those tables we have a Kostka topic we see the customers orders of products the products on hand if I if I query the or subscribe to the Kafka topic we can see we have this gigantic blob lots of lots of Jason so these events that have been taken from the transaction log have been put into Kafka as JSON now the JSON if we come here and if you can see that the the Jason for this is basically what is the schema of our data and what's the payload of the data that's that's what what's what's in these in these messages and that's why the JSON because it and furthermore it's the before schema and the after schema and the before value and the after value so we see the full change events if you want to do Delta's you do it yourself but you can see what the what the change event was and you can more importantly see what the schema is because we need to somehow support schema evolution in this too so I I think I'm think I'm out of time but if I start making changes so we do a select will change and name from anta and Annemarie so if I update this in the database if you watch can be very quick as Jason's going to scroll because we're going to see this update happen it did change if you look I clicker the name now first thing here's the before record it was an and then the after record where it's been changed to a Marie and the schema has also been logged in there so so we can do that again look in the database and we see that yep it was changed yeah our downstream systems now have this in Kafka Kafka is replicated partition tart ich ins and all this stuff if we do a delete we see something similar where the before we see a before is there an after or before I can't really see that very well up here but so we would see it before that has a value and then we see an after that is null that indicates a delete and even more furthermore we send in this this message but for this payload ID 1 0 for which we deleted we're going to so market as deleted so that Kafka has this feature called log compaction so comp is actually just a really a sliding window of evidence right so some days thirty days whatever but at the end of those days all that stuff will be dropped at doesn't fit in that window but coffee can also do log compaction which says keep all of the keys for all the unique keys keep all of the keys the mostly the more recent versions of those so you have the entire data set you don't start losing data but when we when we do deletes we can tell Kafka when you do your log compaction this record has been deleted so this is sort of a tombstone message - Kafka that says when you see this any keys with one zero zero four just get rid of those two because those have been deleted from the data set so and there's a lot more stuff that I won't be able to get to but I'm already two minutes over thank you guys so much for coming I'll hopefully you enjoy the rest of the [Applause]
Info
Channel: Red Hat Summit
Views: 154,431
Rating: 4.705483 out of 5
Keywords: Red Hat Summit
Id: MrV0DqTqpFU
Channel Id: undefined
Length: 46min 5sec (2765 seconds)
Published: Mon May 15 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.