What are write-ahead logs and what are the gotchas?

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

foreign I'm Jenny in this series of videos we're going to do a q a on streaming data the leaders of estuary have been hard at work attending many podcasts around a data engineering community sharing their valuable knowledge you may not have caught all of the podcast so we're going to do a YouTube series taking the best out of these podcasts and presenting the knowledge to you in a q a format if you have any questions about streaming feel free to leave a comment and we'll try to get to it in future videos in this series in this video we're going to have Johnny the CTO and co-founder of astery go over what are right ahead logs and what are the gotchas you should know about let's dip right into it so databases like offer this really useful primitive of a transaction basically like I can make a change to a bunch of tables and all of those changes are going to sort of commit or like become visible all at once together or not at all if it's rolled back so it gives me this way of like changing a bunch of related things at the same time time um but databases are like built on top of spinning hard disks or ssds these days but they're built on top of disks uh which are anything but transactional like a disk if you're trying to write data through a disk it will freely just say like you know I wrote half of it and then someone pulled a plug from the wall and the other half got lost sorry so databases have to offer this sort of transactional primitive on top of a piece of Hardware that doesn't offer transactions so right ahead logs is the technique that database is used in order to offer transactionalities so briefly a write-a-head log is where a database sort of Records the set of changes it you know it's planning to make in an open transaction or it is already made in a transaction that's committed it's basically journaling the changes that are being made to the database as well as markers like whether a transaction is committed so then if you pull the plug from the wall and the database crashes what it's able to do on Startup is look at that right ahead log and say okay this transaction committed so it's applied this transaction was rolled back so I'm going to ignore it and so on so this is really something the database has created for themselves in order just to implement the database in the first place you kind of need this but it ends up being really useful for watching that database and trying to extract out what's happening inside of that database as some external application um and pretty much all of the databases now have implemented features that take like the sort of the raw binary physical logs that the database is keeping in this Reddit headlock and you know adding logical replication streams on top of it to make it really nice and easy to get it out of the database but the problem is these databases are only keeping right ahead logs around for a limited amount of time if you've got like a multi-terabyte database a multi-terabyte table even um the database is going to keep a right ahead log around for really recent changes and not much more than that after after you know things that are far enough in the past it doesn't need the right to lock around anymore it would be a lot to keep around on those so databases don't do that and the primary challenge if you're trying to implement to change data capture is that you rarely care about just the ongoing changes to the table you actually really want like a full sync of the entire contents of the table that this reactive in response to ongoing changes that are happening within that table so it's not enough to just say like I want changes that are happening you know starting from now and going forward that's actually quite easy to do the hard part is saying I want all of this history and then I want it to stitch together exactly with the ongoing changes that are coming from this replication and that's the hard part so one of the like major issues when we started looking at this a couple years ago um uh one of the major issues is that like if you um if you want the logically consistent snapshot the way that uh the bgm still default today the way that the museum does this is it basically takes the transaction it starts a transaction that is locking the database or locking tables within this database and what that is effectively doing is it's causing the database it's forcing the database to keep around all of the Reddit head logs for ongoing changes of that table and then while it's got this this lock it does this Flex star essentially it's like scanning out all of the contents of the table um and the issue is that this can take a while like if you've got a multi-terabyte database that select star it's going to be running potentially for many days um and while that's happening this that your database is basically filling up its disk with Reddit and blog segments which is a great recipe for running out of disk space on your operational production database so one of the core issues that we really wanted to address was how do we do like correct pills in a incremental way without bringing down your production database while we're doing it uh and and keeping the load on that database very low uh so that's um that's like was one of the primary motivators that started this out on this journey uh others are you know for a variety of reasons like we've heard repeatedly from partners and vendors and customers that um like the actual snap you know the data you're getting out is not always logically consistent and by that I mean like you'll get you know for some particular key in your table you'll get like an update that happened before it's inserts uh or you might see two inserts for a particular key which doesn't really make sense because you can only insert a key then update it and then delete it you can't like like you can't insert a key twice it doesn't make sense you know assuming that there's a primary uh primary key on the table so those are two major issues that uh that we were kind of setting out to address in uh you know that we really want it to resolve in the work that we were doing in this space in our next q a video we're going to have Dave the CEO and co-founder of astery go over for people with a technical background but not data Engineers what tech stack would you recommend as well as how would you get started in streaming stay tuned and remember if you have any questions you would like answered feel free to drop it in the comments see you next time

Info

Channel: Estuary

Views: 224

Rating: undefined out of 5

Keywords:

Id: ReS-jb3qmi8

Channel Id: undefined

Length: 6min 44sec (404 seconds)

Published: Mon Jul 24 2023