Debezium - Capturing Data the Instant it Happens (with Gunnar Morling)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
databases are great at storing data storing what happened when it happened who done it that kind of thing traditionally they're not so great at saying this is what's happening right now if you got breaking news your user wants to order a taxi right now your customer is trying to send you money right now but that process is failing those are really useful things to know as they happen but most databases aren't built with liveness in mind they're generally built on a theory that says first you store the data later you can query it out very useful but missing that piece of newness and that's where today's topic comes in we're going to take a look at deum it's a tool that can tap into a huge number of existing databases and capture that sense of what's happening to them right now and turn it into live notifications real time analytics even just fering the data to other systems in a hurry you can use it from to CFA or from Oracle to Kinesis or my SQL to Google pubsub all kinds of options and one of the things I think makes it particularly interesting is it's minimally invasive you can have an existing data model and an existing database maybe even a legacy database you can't touch and yet you can still use deum to tap into that string of data and make it valuable so I think it's worth learning about and to learn about it we've gotten an expert we've got gunar Ming he's the former Project Lead for deum and he's now a deum and Flink expert over at decodable and he's going to take us through what deum can do for you what it demands of you and how the whole architecture fits together so let's get stuck in I'm your host Chris Jenkins this is developer voices and today's voice is gonar moing [Music] gunar how's it going hey Chris um yeah all is good thank you so much for having me really looking forward to our conversation me too me too you're um your name well the name of your project was dropped in a recent podcast we had on the side talking about deum mhm so I thought let's get you in for a proper deep dive into what the heck it is okay yeah that sounds good do it you've probably got as good qualifications as anyone on the planet for this you were the Project Lead for deum for a number of years right right um while I was at Red um I was the Project Lead for about 5 years or so um yes starting in 2017 all the way up to last year yes I bareheaded the project and um yeah experienced quite a few interesting things during the time we'll get into all of that maybe that sounds more Mystique than this but yeah it was you know sometimes programming isn't a glamorous job we have to do everything we can to evoke some Mystique right exactly oh me I can tell you about things you know oh don't ask me I mean I canot talk about everything but you [Laughter] know okay so let's get let's get more brass tacks about this first up what is the beesi and why does it need to exist well what it is uh it's a solution for change data capture so in the simplest terms it's about getting notifications out of your database like postris or MySQL or SQL server or any you know quite a few more whenever something has changed like something gets inserted something gets updated or something gots deleted you want to be notified about this event and then make sense out of that information right so you would like to maybe update or invalidate a cache put the data into to search index um but in a nutshell it's about reacting to changes in the database and telling the outside world about it okay first dumb question why isn't that just a database trigger why is that just not a database uh trigger I mean yes triggers can actually be one way for implementing change data capture and we can talk about you know all the pros and cons um one of the things to keep in mind the trigger would sit on the right path of your transactions right so that would be some sort of um performance implication whereas what theum does is it's what's called log based change data capture so this is an asynchronous process which essentially tails the transaction log of the database um so it's you know very reliable very robust it doesn't sit in the right path and then of course by going through something like kavka it gives you connectivity with all those kinds of sync systems and I guess my question would be how would you do this with a trigger right so how could you send a rest request there or send a message to Kafka um I guess it would be a bit of a stretch to do that with triggers yeah I I mean I kind of asked the question in a provoking way but I think one of the issues and I totally fell for it you did you did maybe I'll throw in I mean the right path is a big thing it does block how long it takes a transaction to complete the other one is I think triggers really work when you're staying within the database when you get outside that's a big issue exactly right but I'm cheating I can't like throw you questions and then answer them half answer them for you okay I'll try better next time okay okay okay well we'll ramp it up we'll ramp it up as we go so um so okay you're trying why why are people trying to capture change that happens because that doesn't seem like a normal thing to do in a relational database Model H okay so that's that's a good one I mean I think there's tons of motivations for for doing that that and I mean we can talk about use cases all day long but I would say in the in the the most common one I would say is just replication in the widest sense and maybe not you wouldn't necessarily use it if you want to you know replicate your data from one postc instance to another postu instance then I guess you would just use the tools which come with your database but you maybe you want to take your data out of your operational database into a data vhouse something like snowflake or you know a realtime analytics store something like a Pino or a search index so you want to cross the vendor boundaries or maybe you have an production Oracle database and now you want to put this data for offline analyzes into a postest instance right so it's uh replicating across all those wendor and system boundaries that's definitely I would say a a super common use case just taking data from one database to another or data warehouse on the S side of um sorts um so that's replication in the viest sense but then there is like um cach updates cach invations so maybe you would like to have a riew of your data which sits close to the user and maybe it has some sort of denormalized representation of your data model so you need to keep this in sync of course with your operational database and then reacting to those change EVS um is the perfect way for keeping that cach in sync with your operational database do you know that reminds me of a project I worked on that had exactly this issue that we wanted and we ended up doing SQL polling which is not ideal there was a mechanism in postgress I remember that let you watch a table and get notifications but if your connection to that session dropped it lost your place in yeah I think you're referring to the listen notify API listen notify yes I am yeah and um yes exactly it it kind of lets you that do that thing but as you say if you don't have uh if your client isn't running you wouldn't be notified about anything which happens during the time whereas um you know with this log based approach all this is like fully reliable fully safe so essentially if the beum would restart after such a downtime then it would uh continue to read the transaction lock from exactly the point and time where it left off before so it stores like the position in the transaction lock how far did I get and then if it gets restarted it will continue from there and you will not miss any change right so maybe we should just quickly dive into that to make sure everyone's on the same page so modern database every time it makes a change it writes that data to a change log partly for recovery but also to hook in standby databases right exactly those are exactly I would say the two reasons for a database to have a transaction loog as a transaction recovery so you know if something crashes the server I know loses power um while some transaction is running then it will be able to go back to a consistent state after restarting and then yes you would also or you could use those transaction locks to keep your replicas in a database cluster in sync and in that sense the BM is acting like like kind of a replication client right it gets hold of this replication stream of data from the transaction lock and then but keeps its own view of the data you could say does it do that by like sneaking him around the side and reading the same binary file or is there some API that say postgress or mic SQL providing that you hook into right so it depends a lot between the different databases uh which the musum supports so there is unfortunately I would say there is no standardized API or standardized way for you know so we could implement the bom once and be done um instead it's a you know a bespoke effort for each and every kind of database so in case of um my sqin posters for instance yes there is essentially uh you know remote protocols let's say so you establish a connection to the database and then this will give you like a call back or will notify you whenever there's like a new event in the transaction lock so you can just be any kind of remote client for other databases let's say SQL Server this gives you essentially a view of the transaction lock in forms of what they call CDC tables so change their capture tables in that case theum goes to those tables and queries them um for cazandra again then you know we need to actually have a component which runs on the actual cazandra host so we can go to the actual transaction lock files on the file system and get hold of them so it differs between all the connectors but then the format which the BM exposes that's you know one un unified and Abstract format so the users don't have to care about all those nitty CR details so you are in those you're in one of those situations where this is horrible every vendor has a different standard we shall create one new standard and then do all the work right yes exactly I mean that's exactly the challenge and um you know I would say it's a bit of a long tale of course of uh connectors so every now and then you know people come to the debm community and they say hey we would like to have support for db2 on the main frame can you make it happen um okay and then you know the the project needs to weigh of course so is this like a common thing or how feasible is it terms of db2 and the main frame so how would we even go and test this because I mean I don't nobody in divis community has a Mainframe under their desk so how do you even develop this kind of thing um but yeah that's that's the challenge but then on the upside you know the project does this sort of unification effort so as I mentioned the format which which it exposes to users this is abstract and generic I would say it's kind of a defao standard which um it managed to establish it's funny how in our industry it's only the defacto standards that actually seem to work and the ones where three large vendors get together and announce it apart from SQL maybe they all seem to flounder right yeah I mean there were some initiatives around having a CDC standard but it didn't really go anywhere um so far and I mean this kind of with the be it kind of naturally happened just Wenders would you know announce their support for the deum change event format so let's say um SB um you know they developed their CDC connector for for their SB um database using the debm connector framework um and now they also support the deism change event format and then also all kinds of um you know consumers of those change events they also support this format something like Flink SQL or um let's say materialized they actually natively support the be some change event format so this de facto standard thing just kind of happened yeah defecto I guess okay give me an idea of what that standard looks like then um you know if I do uh if I do an update on a user's email address in postgress what's going to happen on the deum side right so in terms of the event format mostly it is about the so the the schema resembles the schema of your tables so let's say you know you have this customer table with your email address column and you have five other columns well then your change EV they would have those six Fields representing those those cols by default I mean you can you know you can override configure filter everything but by default it would be that and then in case of an update well you would have the previous stay of this row and the News state of this row so you would see okay email change from I don't know Chris example.com to Chris Jenkins at example.com so you would see old and new value again depending on how you have configured your transaction loog Maybe maybe you know you want to save some space there so you don't have the old value but that's the challenge list of it so that's the actual structure the it resembles your table structure and then there's um a large chunk of metadata also part of it so like time stamps uh what kind of event position in the transaction log the offset in case of my SQL for instance it can also tell you what was the query which was triggering this change so you could identify that um so all use that for analytics as well yes right your databas is used yeah exactly yeah you could that's interesting I didn't know that okay um and what's the in that format if I've got like 10 columns in my table and I just update the email address do I get the whole of the previous Row in the whole of the new row and can I tell and do I get something saying but it was just this column that changed so you would by default get the whole row um for for most of the connectors so for instance in case case of mongodb and I think Cassandra as well then you actually just would get any modified columns for but otherwise you would get the whole um at least the whole new row for post for instance people typically don't expose the entire old row but course just it would take too much state or space in the transaction lock okay but would also be told which of those fields in the new row changed or do I have to infer that you would infer that just by comparing uh old old and new value yes okay okay okay so I think actually it makes sense um again I feel like you're trigger me but I my job in a way I guess I feel it makes sense to have this as a default because not every sync system supports what we would call partial updates right so you I guess you have to work from the Assumption okay I only can write like entire records on on the Sy site at least as a safe default and then of course if you have a smart system it will be able to you know just apply partial updates but I feel in the grand schem of things giving you the large the entire event this this kind of makes sense also if you just look at it from instance uh from the point of view of having data just in Kafka and then you would be able just to go to the latest change event for a given record and this would tell you okay this is the state of the world at this particular point in time for this record okay you that raises a much bigger issue I'm going to get to into in a minute but you just mentioned Kafka for the second time does using deum imply you'll be using Kafka no that's a good question I I guess I should have mentioned that um so it does not require it most of the times people use de beum with Apache Kafka as the messaging Fabric or messaging layer so you know they they can connect it to all kinds of Sy systems and this is also historically speaking how theb started so essentially is a family any mean it's more than that but at the core it's a family of Kafka connect Source connectors so they are about taking data from the external system all those databases into Kafka but then sorry we realized over time or we just got that feedback so people liked the functionality which the musum provided so all the CDC goodness but then not everybody necessarily is on cavka right so maybe they're using something like AWS Kinesis or Google Cloud pops up as manage infrastructures or maybe to use a py p or nuts all kinds of things R streams and they still would like to use the deum uh change EV capabil or change stream capabilities and this led essentially to the introduction of what we call the BM server and this one is you know a runtime you could in your overall architecture you could compare it to Kafka connect so it's the runtime for the connectors but now this de server project gives you connectivity with everything but Kafka so you can use this to send events to all the things I mentioned in quite a few more um so that's the second way for using theum and then there's even a third one and this is about uh embedding it as a library into your Java application so if you're a a Java developer or developer on the jvm and you have very bespoke needs then you can use this Deb deum engine uh component and essentially just register a callback method and then this will be invoked whenever a change event comes in and you can do whatever you want uh to do right and this is uh typically what integrators of debm do like Flink CDC is a great example so they take the debm embedded engine then they expose this straight into a Flink pipeline or a Flink job without having to go through something like kfka okay so if I was if I was in that situation where I'm trying to invalidate a Java cache based on changes in the database would I probably just go that route right I think um again I guess it depends but let's say you have an inmemory cache um yeah I guess that would be the the simplest way of of doing that right I mean now that is the question okay is the application clustered and is it like you know scaled out then do you have this invalidation call back in each of your notes or do you need to sync this invalidation message yourself but yeah that would be one way of doing that you can always tell when you're speaking to an expert programmer because they start with it depends I feel like should prefix every second answer was dead yeah yeah we come take that as red but okay so that seems like an awfully huge amount of work right go for that project and one of the things I instantly wonders like are you are you doing things like you've got a connector to Cassandra that has to be written in Java and you've got another connector for postgress that has to be written in C and how many languages are involved in so it's mostly really Java uh so just by virtue of being based on Kafka connect uh the connectors are implemented in in in Java there's a little C component indeed for for postgress because uh in post works there there's this notion of logical decoding and logical decoding plugins which essentially control the format in which you receive those change events from postest over this remote connection and um so for the longest time and and it's still the case the BM had its own logical decoding plugin which emits those change events uh in the Google protocol buffer format so in efficient binary representation um so that's the little C code there is by now I also should say as of postrest 10 um there's a default plug-in PG output which is um you know coming with postest it's available in all the cloud providers like RDS for instance so now I would recommend people to use this logical decoding plugin PG output but yes still there is this uh C based component called decoder Buffs in in the museum okay so mostly Java with a tiny bit of C exactly right that sounds I'm just trying to get well I should actually I should say there also typescript because there's deum there's deum UI as well so there's also a web web based UI for it so you can set up things there and this is it's fair chunk of time SC okay what's the UI do um essentially it's about giving you a configuration and management experience so instead of you know as you would do it with kaf connect like configuring everything with Jason um this gives you like I would say rather intuitive um UI um and it guides you through this process of of setting up those connectors one example for instance where is really useful is um there's this notion of inclusion and exclusion filters which describe which kinds of tables or schemas in your database you actually would like to capture because in the transaction Lo everything is contained right so all the changes to all the tables are there but maybe you just interested in I don't know capturing changes out of this one table out of your 10 tables or out from changes out of the 10 tables out of your 100 tables and the way you configure this is with um filters which is essentially regular expressions and now one common thing we you know got asked about again and again is so hey I don't receive any changes from theum what what's going on but I know I do changes in my in my database but nothing comes out of it so what's what's going on and very often the answer was well your filters were just wrong so you just excluded all the tables so naturally the connectors wouldn't emit any any change event everything would be discarded and now with this UI it gives you actually a preview so you specify your filter expressions and then it will tell you okay based on this configuration this is the set of tables you actually will capture and this is a huge usability Improvement okay that makes me think because presumably the standard change log has all the changes to the internal tables too yes um right yeah so what's the question so generally I guess you I I would guess generally you don't want to hook into that except maybe you want to hook into create table statements I don't know can you do that uh yes so most of the times we would like to ignore any internal tables uh and then the debm connectors they just do that by themselves so I don't know for post sorry for Oracle there's specific schemas I guess even which you would like to just ignore so they would never arrive uh um they would never be sent out to the user as you were asking about ddl events um again it depends a little bit on the connect to what it does and what the database supports so not all of them unfortunately give you an event if there was a ddl change like I don't know alter table or create table um so for instance in case of um well actually post it also it tells us that so there's like a special kind of event um but yeah it it depends a little bit on the on the connectors okay I'm just thinking I'm wondering about setting all this up right right and so you're saying that if I'm connecting to for example let's pick my SQL right what what you're essentially doing is hooking into my sqls change log API that they have for their replicas well there's uh yes there's actually a library which we use for that in case of my called the bin log client so the transaction log in myql is called the bin loog binary log and there's a bin lock client Library um which be users yes I'm just thinking classic I mean we've all Fallen victim to not invented hair syndrome MH if I now know that the trick to De beum is that it's kind of behaving like a special replica and the master database doesn't know any different why would I not roll my own and get my own preferred format for the data all right I mean you totally could if you don't have anything else to do I guess guess it could do it um I mean there's a few things to consider so a it's very easy to underestimate the effort I would say so you know there's things like there's a very rich type system for instance in database like post so you would have to implement support for all those POs potential types which there are uh in terms of data types there's this entire notion of um snapshotting which we didn't didn't discuss at all so far so if I go to my transaction logs um I won't have the transaction logs from last year right so if uh or or last month Maybe because the database as you mentioned it uses those for transaction recovery and and replication so at some point it will discard all the chunks of the transaction Lo because otherwise your database would just run out of disk space so I've got every change but I haven't got any starting state to apply exactly right and let's say you would like to put this data into snowflake so you can do analytics you you know you want to start with a complete view of your data right you don't want just to have the data which happens to be changed right now you would like typically to start with an entire back fill of what's there and so you would have to implement this by yourself um then you know there's tons of corner cases in terms of failovers and restarts and making sure you never miss any kinds of events um again coming to postest coming back to postest you need to make sure like um you acknowledge the transaction lock you have consumed otherwise the datab database would hold onto it for uh forever there's Corner cases there which you can get wrong so you could sort it all out I mean it's software it's all you know it's doable but I I would really not recommend it um and the other thing is of course um you would miss out then on this notion of the de facto change event format right so you would come up with your own uh event format but then all the sync connectors which supported now um they wouldn't support your format right so you would have to address that concern as well yeah yeah I can see that be a lot of I'm trying to get a sense of the scope of a project like this and I think the moment you mentioned implementing the whole type Universe I was right and then you know there's so many things also to consider like um a to make this work efficiently monitoring logging I mean there is uh you know stories to tell about all of those things so I mean yes doing a simple PC which you know would print out some change events you get from a database on the command line that would be quite easy uh but then like really packaging it into a you know production worthy system with the complete features that that's that's a huge task I mean the team has been working on this for for many years and it's not a small team as well how how big is it yeah um let me see so when I took over the project it was essentially two guys working it so it was myself and another redhead so by the way I should mention that so rad is the main sponsor of the project so you know the core Engineers they are employed by Rad but then there's also other companies working on connectors under the Deb umbrella so folks like stripe or or Google who did their connector for cloud spanner they do this as part of the deum community or insta cluster for instance they work on the Cassandra connector nowadays so it's quite a few companies ring behind it but let's say in the core team I would say it's like seven or eight Engineers um by now something like that I'm curious I mean this is kind of an aside but since you mention it I want to know what's why stripe why are payments service that's a good question I don't know to be honest what their uh use Cas is but um if I had to guess it would be feeding data to their analytics store that's that's definitely common thing um I know yeah there's organizations like Zar CRM they work on their um CRM tool they have SQL server and um they you know have tons of CDC use cases so for instance they made a huge contribution towards deum um so they could enable their deployment of hundreds of deum connectors and for instance just by means of this scale of use case the core team you know we don't really have the capability to test with uh databases and just getting this feedback and then also the contributions from those companies is like super valuable okay yeah I can see that once you're I can see their use case and I can see they're large enough their use case to be become dedicated to an entire person supporting it exactly yes and then they have you know this I know this weird Edge case so yes if we run the beum on 500 databases and then we do this particular thing then it doesn't work and okay yeah so let's let's fix it but there's no way in the world for how the team could you know get on top of that by just by themselves yeah yeah 500 well 500 machines is becoming tenable but it's still a lot to ask for an open source project exactly right yes okay okay in that case let's get let's get into um the idea of snapshotting which you mentioned because I could see I could see my first use case for deum would be capturing things to send email notifications right and that I just need to know what's happened MH but the analytics case yeah I do need to worry about getting the whole world which I can't get out of the change log how do you solve that right so the way it essentially works is um in the simplest case it just you know first of all the connector will up on Startup it will get hold of the transaction lock and it will just um you know get a reference to the current position in the transaction lock um that's called the offset and then it will start a snapshot transaction so to make sure we you know we scan all the tables which are there I mean again coming back to your filter configuration right so it would scan through all the tables you you want to to capture you also can customize this so let's say you work with a notion of logically um deleting your data or soft deleting your data so maybe you just want to export like your non- deleted data into your an store so you could exclude all those um soft deleted records so you can do that and then once um this snapshot process has completed um then it would start to read the transaction lock from this offset which you memorized before so essentially this is to make sure that there is nothing lost in terms of changes which happened while you did uh this this snapshot okay so you're you're kind of pinning the transaction log going to another window logically and saying select star from users exact and then coming back back okay does that can those two run um a synchronously it's not not blocking anything yes um right so this isn't uh blocking anything so for instance coming back to postgress I feel like I'm saying postgress all the times let me say postgress another time we're definitely guilty on developer voices of saying postgress to mean generic good quality relational database yes exactly and and they just get so many things right and and also in this case so they have this notion of export what they call exporting snapshot so essentially uh the way works you create What's called the replication slot so this is essentially a handle to a transaction log and this you know is a pointer to how far you have read the transaction lock so you start this replication slot this gives you back the uh it's starting position it's starting offset now you can and and option you also can say I would like to export this offset sa snapshot which that means you can start a another transaction at this particular um offset and you could even do multiple ones so let's say debm not quite sure whether they do it by by now um showing a bit that I have stepped down from the project but at least in theory and it totally should be the case you could have multiple snapshot transactions all at the same offset so you could read multiple tables or even multiple chunks concurrently just you know to to speed up things um and then once you're done you would start to consume from this um replication slot and you would just be sure you haven't missed anything and also of course any new changes can come in during that time so yes that that would all be async okay so you're doing like a almost like a point in time query that you can then synchronize on both sides exactly okay I um we should link to this in the show notes but I saw you had a talk current right which I watch was very good about how that process is now being overhauled right want to say a bit about that t oh yes let's definitely talk about this notion of um incremental snapshotting because I mean so the beum always had this notion of let's call the initial snapshotting as I you know roughly described it um and then there were a few issues with that um so for instance you couldn't change easily those filter configurations so let's say um you know you have set up your table filters and a way you capture 10 out of your 100 tables and now there's uh you realize oh actually there's another 11 table which I also would like to capture so you would change the filter configuration but then in the traditional snapshotting approach there was no way for you to actually also capture now this 11th table and then you know continue to stream the changes for all the 11 tables so this was something which just wasn't doable very well you mentioned uh yes the transaction log gets pinned during this initial snapshot and and that's exactly right because um well as we don't consume from the transaction log while this snapshot is running it would you know use more and more dis space while this traditional snapshot is being executed freeze that point in the transaction log while I select star from 10 tables exactly and that can get big yeah and that can be big I mean the snapshot can run for multiple hours or many hours if if you have a large amount of data um so that's that's maybe not desirable then you know sometimes people just want to re snapshot a specific table I know maybe theyve like deleted their cka topic accidentially you know should happen or maybe they just want to like backfill a new sync system they they stand up um so they would like to re snapshot specific set of tables and this wasn't something which was doable and all this is Now supported with the notion of incremental snapshotting I'm not sure whether we can should go or can go into details uh but essentially it inter interleaves the events you get from the uh Snapchatting phase and the stream reading phase so they happen concurrently so all those uh you know problems are kind of solved also you can trigger them interactively so you can interactively say I want to capture this table customers and it will go and do this while it continues to read from the transaction Lo so yeah so still very active I mean you could think it was just okay read that file and turn it into the format right and then spread that out to many databases but you're still actively improving the processing oh yeah absolutely I I mean um there's so much things which still can and and are being done like I mean new connectors are added all the time this snapshotting process is being reworked I'm sure there is tons of things which can be improved still in terms of performance like you know really making sure everything happens in parallel all this kind of stuff and then of course well the beum uh traditionally speaking it just concerned itself with getting data out of a database into something like Kafka right or or Kinesis but then of course you don't do this for just of the sake of putting your data into Kafka you want to do something with it right so you need to take this data and put it elsewhere as we as we discussed and I mean it just wasn't and I would say still isn't really the scope of the project but still people came to the deism community and said okay so how can we take now this data and put it into another database and this is why they just recently added a jtpc sync connector so now there is also a sync connector which allows you to take the data from Kafka and put it into all kinds of databases which you can talk to why JPC so like all of them I guess and um you know this is very nice in terms of usability because well things like theism UI they can give you a complete endtoend experience also those connectors they work very closely with each other right so I don't know if you were to use um other other I should say confl has JBC connector right and there's other JBC in connectors you need to make sure they are like uh configured the correct way so that you know they um work together whereas now here as the musum is providing source and sync connector all this gets very seamless it can reason about schema changes and apply apply them to sync database so having this sync story I think this is a huge value to the project and this by itself creates lots of new uh things to work on yeah yeah even if it were exactly the same quality as all the others it comes with the same assumptions on the right end as the read end and that's going to make easier right exactly exactly yes so this is so that is it that that then gets into what you mentioned earlier where we're saying like you might want to do replicating Oracle to my SQL right that exactly it could could be uh could be used for that right in that scenario I hear you holding your breath as you answer that question do you think is it a samee thing to do if I want if that was my task I wanted to get Oracle into my SQL the beum be a good choice yeah definitely I mean people do it right so maybe I mean I used to work in in in environments where you just wanted to have a database with the I know the data from yesterday so you could run some atoc queries or develop like your new functionality against that and you know get like execution plans and all this kind of stuff not from the live database but from this uh other database and yeah sure I mean maybe you don't want to use like this super expensive database for that purpose so you use another one um I could see that and people definitely do that okay okay cool that sort of relates to one of the reasons I started thinking about Let's do an episode on deum we did as I mentioned we did this episode on real time um data streaming and why we care okay with Thomas camp and we were considering like if you've got an existing batch-based system stuff feeds into let's say my SQL through website you you know standard architecture and you want to say I would like to get into real time Eventing I can see some advantages to it but change is expensive and new projects are risky what's the minimum step I could take would that be just turning your database back into a realtime stream of events using deum is that a sane first step into the real time world so I'm not quite sure whether I fully understand what you ask so you have a batch based architecture and you want yeah just imagine a traditional stack that where there is my SQL at the the heart of it and you're doing transactional updates to stuff and you would like to start doing real time notifications to cines or something but you don't want to overhaul the entire thing and say okay let's have an 18month project to do Event Systems it's going to actually take three years that's going to fail right is deum your minimum crack into the world of real time right no I think yes I think that's a fair assessment um and people very commonly do that and actually often there is this uh scenario yes they don't want to touch the source code of the application sometimes they cannot even touch the source code of the application I mean I remember at the past job we had this war or or jar running on our application server and nobody had the source code any longer so it was running just fine but you couldn't change it because the source code was gone so you know there's definitely this kind of world and sometimes people just shy away from it um I mean of course you need to be careful about it right because if you are in that sort of situation typically your data model also is maybe in a bit of a non ideal State and now the question is um do you want to propagate that to like your new world right so maybe I know maybe you want to use this for migrating off of a large monolithic application to microservices that's definitely a common use case but then maybe you want to uh you know uh Shield those new microservices from all this weird Legacy modeling let's say you have a database and there's like column names can only be like 30 characters long I mean there's such a database so um you know you want to have some sort of transformation layer in between often times and this is what people typically do with Kafka connect for simple stuff or maybe Kafka streams or pet Flink so you know they can give a nice and clean view of the data so that's the you know maybe they would like to limit stuff they would like to read name things or change the types I know maybe everything is a string in your old database you would like to expose proper yes proper data stuff happens right so yeah yeah seen it we' all seen it um so having some sort of uh transformation there between like your old Legacy world and the new world often makes sense and that's the thing I would advise people to consider not just to use CDC to expose the weird Oddities but make sure you know to put some consideration into this uh to provide properly crafted data contracts yeah data contracts is the thing isn't it I guess what you're saying right now you're a prolific blogger and we will link to your blog for oh yeah that's amazing thank you so much but I I guess what you're saying is that deum is the same way to turn a batch system into a series of real-time events but then there's also the R Mantic Journey series of Json packets that say insert an update is not the same as an event system yes exactly so um I mean there's so many things to say about it right so first of all there's of course the question about granularity of events uh so what CDC does it's about table level or roow level events right so but maybe I don't know you were to model your application in terms of domain driven design you have something like aggregates um and in a relational database they would be persisted in multiple uh tables right so let's say just a s simple example a purchase order and it has you know multiple order lines so in your relational database this was would be stored in two uh tables one for other headers and one for other lines and then there's a like one to end relationship between the two oh yeah so now you would get those table level events but maybe what you actually would like to have is an entire event which describes the entire proess order um and now the question of course is how do you uh you know achieve this task and again stream processing can help that with that so you could use something like Flink SQL for joining those Ro streams and exposing a higher level stream which then has like a nested data structure there are things like the outbox pattern which can help with that um but which then also would require you know to modify your application and actually emit those outbox events so maybe in a legacy kind of use case it's not desirable H yeah the pain got to go somewhere exactly right ex that does raise the question of um how beum handles transactions if I've got that order header and the individual order rows am I going to get that as one transaction in the deum log output whatever you call it right great question I love that it's kind of my favorite thing uh so no by default and what you get right now is um you get those events one by one so let's say you have your one purchase order and it has a three order lines you would get those four events typically by the way in Kafka on on separate topics because there's a correlation between the table name and the topic Name by default um and also you wouldn't have like strong ordering guarantees so it could happen that you receive one of the orderline events then you consume the order header event and then you get the other two orderline events so um you know you get them one by one but still you can correlate them and this this is possible because each of the events contain their the transaction ID where they um are sourced from so you would see for all of your four events okay they are originating from transaction one to three and now what the BM also can give you is another topic which tells you about the transactions which have been executed so on that separate topic you would receive a message okay or an event message um an event sorry transaction one to three has started and then you would receive another event which tells you okay transaction 1 to three has committed and by the way in this transaction there's one event for order headers and five events sorry three events for order lines so you have this information now you could use all this to implement some sort of buffering logic either in your sync system so you could put the data into some sort of staging area and go you know and essentially just propagate the events once you know okay I've received all those four events which I expect from that transaction or again you could use something like Flink to implement this in a streaming fashion yeah there's so there's a bit of leg work and almost detective work to reassemble what those atoms of change right mean semantically yes exactly I mean the problem is and people sometimes ask about that so hey can we get a single event for an Entre transaction from theum and the answer is so yes in theory this could be possible but transactions can be arbitrarily large right so they you could do a bulk delete and delete 50 million records from your database in a single transaction and there's just no way we could give you any a single event which describes 50 million records so it would be just way too large also the if we just go back to the simple order example there's no guarantee they've got one transaction per order it could be batched up with lots of other things that happened in that transaction yes exactly right so I mean uh yes to so you would have to reason about that as well okay why don't we why don't we add a slight side because um I don't know that much about Flink and I know it's now part of your day job at decodable give me the overview of how you might recapture that order event in flank um yeah that's that's a good question so how would it work well I would essentially um you know first of all listen to this transaction topic and then based on that I would Implement some sort of uh buffering logic so I can have state stores in in in Flink I'm first of all I wouldn't probably wouldn't be able to do it right now in Flink SQL right so I would have to do some Java programming but so I could Implement some buffering logic um which essentially Compares those those events and sees okay um you know this this event is for for a specific transaction I haven't received all of them yet so I would need to put this into a state store um and then you know more events come in and then you would be able at some point to see okay I've gotten all of them so now I can actually go and emit um this what whatever the aggregated structure from that that should be okay so you're writing a bit of java that has to recreate that logic of grouping them back up exactly right and then of course if we wanted to abstract things a little bit we could say we could think about having some sort of specific query operator in Flink SQL which would tell I don't know select transactionally from whatever stuff I do and then you know some let's say that was a manage platform like decodable and which could do that and this this could be a functionality which we could provide okay do I get a slice of the royalties for giving me that idea well this idea might have been around before Oh okay damn I have to keep working on that then no defin I mean I can tell you this is absolutely a very common need and scenario because just to make things tangible people often times are in this sort of scenario when they want to put data into something like um elastic search because what happens there is coming back to this purchase order example you would take the entire data from this purchase order and all its lines and you we would like to put it into a single document in an elastic search Index right so you don't want to do query time joining I if you even can I don't know so you would like to have an entire order and all its Associated data in a single document there and now the question is if you implement this um as is with Flink or Flink SQL well the data would come in on those two Source streams orders and Order lines and then this join would run but it would run whenever a new event is coming in so essentially you would materialize this join when you have a a single order line and you would put this into elastic search the next the next line comes in so you would materialize it again and you would have an order with two out of three order lines so you would have like a partially complete view on the world now let's say the user comes at this point in time and they go to elastic search and do a query you know depending on the specifics of of timing and so on they would get like incomplete search results right so ideally what those people want to do is they would only write to elastic search once they know okay now this document actually is complete and this is uh you know the complete view of of data so this is a very common requirement um and yes using those transaction markers which we have in the beum this is a way for for solving it and again by the way the strip guys they spoke at Flink forward about how they implemented that so it's already happening okay I'll have to link to that in the show notes dig that one up yeah definitely again we coming back to the old chestnuts of transaction boundaries normalization and denormalization yes I mean it's same same but different right they always with us exactly so um I I think we should probably end on like what's what tends to be the simplest use case that people start with and what's the most ambitious or crazy use case you've seen okay some real world things to take this right I mean most simple definitely I would say feeding data to olap or ol stores or or data warehouses I feel like that's the that's the most common thing people would do by the way also we didn't really touch on latency and why it is so interesting to do this all in real time so I know of people who take the data from the operational my SQL clusters and put them into Google big query so they can do analytics there and they have an endtoend latency of less than two seconds so it's really like kind of instantaneous and you know this opens up many interesting possibilities right to to be able to go to your data varus and have a live view on the data so you could drive dashboards off of that and and all this kind of stuff so I would say that's like you know the most common and I would say also simplest use case well um in terms of most complex that's that's an interesting one we have an interesting one um at decodable where we actually use it for propagating changes from our control plane database to our um data planes you know when people create like new things in in decodable like a new connection we would like to react to that so we can then materialize the resources in our data plane um so that's I guess it's bit more on the advanced because you're doing like a software as a service uh kubernetes exactly yes you submit a request it turns into running exactly right it should be like reliable and and and all this kind of stuff so um and can we use tum for that but then also there's this entire notion of um yes we touch briefly on it like migrating um from a Legacy application to maybe microservices or new world of of applications um I think this is more advanced because it has a strong notion of stream processing and like massaging your data denormalizing it putting into new shapes um sometimes people choose use just something like kka streams of Link SQL to continuously update materialized views um you know so I don't know maybe they want to have like the revenue per category in specific time windows and this should be updated in like a continuous fashion um I guess it's also bit more on the um Advanced side of things okay I'm going to cheat I'm going to break my own rules I'm allowed to because you I know you did a talk about this as well and you hinted about it and we really shouldn't leave without discussing okay let me from going from a legacy system to a microservices system yes I can I can begin to see how that fits in but you did a whole talk about it I can't see give give me a teaser on how the beum fits into where micro where're migrating to micro oh okay yeah yeah I'm so glad you asked um so there's an interesting pattern there which is called the Strangler fake pattern Strangler fig right so the idea is right the idea is you know uh so that's the Strangler fig blond and it kind of wraps and strangles it's tree it's host tree at some point this old tree um dies off I don't doesn't make sense I know but so it's you know it strangles around that old thing and grows from there and that that's kind of the picture uh for that uh migration um approach so the idea is you don't want to do like a huge big bang migration where you as you say you work on this for three years and it takes another three years and then it fails rather you want to go gradually and you know you want to reduce the risk and you want to go step by step so the idea is you you take one chunk of your functionality maybe you start with a single view in your web application I don't know like the order history revew so that's the thing you want to serve from a new you know shiny microservice written in closure I don't know something like that um and so you start to extract this functionality from theol into your new microservice it's its own thing and of course this should now have its own state store it should have its own database right so you don't really want to share data stores across service boundaries um so you would have it would have its own data store and then you would use uh CDC to propagate all the rights which at this point still go to the old modelist so you would go propagate all the changes from there over to this new database of the new micros service and then in front of everything you would have a routing component which essentially says okay this is a read request for the order history view so I send this over to the microservice and let it serve from there and everything else should still be served from from the mon so all you know all the other reads all the rights for that part of the domain they still go to the to the monolith but I have this read functionality which already is served from the new service and you use CDC to you know gradually extract data from the old database to the new one keep them in sync and then you could you would keep going right so you would extract more functional at some point you would say okay I everything which is related to purchase orders this should now be owned by this new closure service or I don't know Elm or or Z what's your favorite language these days uh let's say has we don't right so this is new H service does anybody build web service in h i don't know but so you do okay so you have this new H service for purchase ERS and you know this receives now reads and writes for that part of the domain yeah but then it could still be you have functionality in the old uh application context which needs to know about purchase orders right so essentially now you could say okay I also propagate changes from purchase ERS back to the old world so we can you know I don't know reference the data there and so on and so you would do it kind of do it like bir directional but you would just have to make sure you know that for each part of your domain for each table essentially or each aggregate I don't know there's one system which is in charge right so you what you would should want what you should avoid is like having rights to I don't know customers or purchase orders on both sides of this architecture because then you would end up like propagating them forth and back in cycles and it would be kind of weird yeah yeah end up with a reconciling multiple Master databases problem always horrible exactly so you shouldn't you shouldn't do that but otherwise H for your web services go for it I G have to do a whole episode on that because it's very nice oh yeah I will watch maybe I do a coding walk through or something but okay okay we'll link to the full talk but that gives me the picture and I like incremental migrations Big Bang projects are often disaster yes absolutely it's it's a nightmare right then and with that um approach you can you know you can go at your own pace you can pause maybe at some point you realize okay I'm happy if I have just extracted those three things into their own services and the rest can remain in the old system so you can do that um so it's really it's about minimizing risk right so you don't want to change everything at once um and then nothing would work maybe so it's about risk management yeah and is that the kind of way you'd is that the best way to get into using the beum in a system would you say like let's just choose one small piece and pick it off I would say so yeah I mean defin I'm always a big fan of you know just taking the something be the beum or whatever technology and applied for one part of your problem space and see how it works get a feel for it and then you realize how amazing it is and then you you know use it for everything I think that's a perfect tagline to end the episode absolutely awesome gar thanks very much for joining us absolutely it was a great pleasure Chris um thank you so much for me thank you gar now we dropped quite a few references in that discussion so I'll compile them all together and you'll find the links in the show notes as usual I'm going to add in another link there we kind of touched on in that discussion why you can't just use SQL polling to get realtime data out of a database you do have to look at the transaction log if you don't want to lose data if you'd like some more detail on that there's a really good talk by a former guest of ours Francesco tiso and he can tell you some horror stories from just relying on jdbc so check that out if you want the full argument as well as that sort of back reference I'll give you a forward reference as well we need to have an episode on Apache Flink and it is in the pipeline so if you want to catch that now is a great time to click subscribe and notify and if you've enjoyed this episode please take a moment to like it rate it maybe share it with a friend you know where to find the buttons so I leave you to it until next week I've been your host Chris Jenkins this has been developer voices with gunar Ming thanks for [Music] listening
Info
Channel: Developer Voices
Views: 3,345
Rating: undefined out of 5
Keywords:
Id: 88j7EEiyqzM
Channel Id: undefined
Length: 62min 46sec (3766 seconds)
Published: Wed Nov 01 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.