The Materialize Incremental View Maintenance Engine | Materialize

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] I'm here from materialise Inc I'm the chief scientist at materialise we are building something which I'm gonna tell you about now I'm gonna tell you my take on on what we're making there's some contact information up now there'll be some links and stuff as we go but there's all this information will be again at the end of the talk so don't get stressed out at the moment about you know grabbing anything down or wandering off onto your phone so what I'm gonna tell you about that we're building at materialises we call the sequel view maintenance engine it's the title of the talk that's me materialize and perhaps some of you are scratching ahead perhaps not but perhaps some of you're scratching your head saying what's what's a sequel view maintenance engine and so I'm gonna talk through that let me talk through how it comes to pass why you might want such a thing what sorts of problems does it solve and what opportunities does it open up that you might not have imagined before and you might start scratching your head and thinking about now so many of you are probably familiar with relational database so this is something like like Postgres or my sequel or something from Oracle our VM or what have you I've drawn it here as just a generic relational database management system I put it in red because these things are usually pretty hot they're pretty busy they're doing a lot of important business stuff for you typically these are handling things like your business critical reads writes transactions when customers show up and they make purchases you write stuff down if information contact information what your customers changes you write that down you're all these things to be consistent this is in many cases your source of truth for all of your important business data no at the same time in addition to doing these these sort of interactive reads transactions write stuff like that these database is also managing another class of workloads these I've drawn them a little bit differently with bigger arrows because they're a bit meteor and chunkier these reporting and analytics and monitoring tasks they're asking questions like how many how many engagements did we have in the past 24 hours maybe you're monitoring some of your system infrastructure so you're wondering you know what are our KPI is what are our latency is that we're seeing when we're handing numbers back to people just other analytics that you might be looking at so up mystic you know when people buy buy a foo what do they also buy in the same shopping basket you know we want to try to figure those things out so that you know either you sell more things and make more money or or you learn a bit more about your customers but this is a class of workload that is I would describe it as a little list in this picture at least a little less interactive you're not you're not interacting with hot database you're sort of landing on it and asking a big query and getting the answer back and so in the modern era of let's say data warehousing type things the data these queries land you know not so frequently and refresh not so frequently and we're gonna see if we can figure out how to change that now the current state of the world at least one of the current states of the world we hear from from a bunch of folks is if you want to address this problem you know more and more analytic queries of more and more complexity people well the current tools for sure in the database services aren't great for this so people move towards denormalization we're familiar with us you take all of your exciting records about interactions that people have had with you and instead of saying you know here's a customer ID go look up the customer later you black down to everything you know about the customer right in line in these records you make a relatively big wide tables of all of your interaction so that when you come and do these these analytics these these reporting things you just aggregate up running through all of this data you know you count up for every what states or something like this the amount of sales you've done in it and you were lucky that you wrote down this date instead of just the customer ID because that would have been a join and that those are hard and so a lot of people tweaked their internal data architectures to look more like this D normalized form a little less like a relational database in a bit more like what could be amenable to a stream processor which which is is fine like there's nothing fundamentally bad about that it causes some challenges so the good thing that happens I should say sorry is that you're now able to cache a fair bit more you able to push this denormalized data out so that these queries that come in can land in the cache data and instead of performing a complicated multi way join in your relational database you can just read some data right out of your cache in and serve some answers on the flipside the problem though is that life becomes a little bit harder for everyone on the the left side of the screen so these reads transactions and writes have a lot more work that they need to do when data changes so like if a customer changes their their contact information they move from one state to another you have the potential need to go rewrite all of this data and all of these places that you've scribbled down your your specifics about each customer it would have been helpful in that setting to have left this as a customer basically the primary key from from some customer to all the information we know about them and not have expended everything out in place so in a lot of settings this leads to ossification is it's hard to change things someone shows up with a new exciting bit of information they'd like to write down about customers maybe you have some engagement score that you're trying out if you have to scribble that down in every single place that the customer is used that's a lot of work to go and do and it makes it hard for that person to try out a new feature similarly if someone comes in with a new query and says oh I'd love to look at you know our data aggregated this way we hadn't realized we'd need before if idiot isn't laid out correctly if it hasn't been denormalized appropriately maybe that's a non-starter maybe it requires a lot of data engineering to put that into place so this is you know this can be a bit of an awkward pickle here where you want to both support these relational database interaction modes and also larger aggregate queries with prompt you know interactive vibe to them as well so that's that's what we're gonna we're gonna talk about we're gonna talk about taking this diagram and spinning it around a little bit in a way that hopefully you might have imagined in the streaming space we're gonna let you think of a relational database still is a source of truth so people are interacting with they're still using their reads and their rights and their transactional updates but rather than try to land analytic queries onto the relational database which isn't exactly designed for this we're gonna think instead of the database that's producing exaust it's gonna be producing in in different language change data capture or SKU be producing changelog or bin log basically we gave you a record of what's happened in the database what records have been added and removed to each of the various tables and we're gonna feed that into a data processing engine this is gonna be materialized for this talk so I'm sure all of you are familiar with other sorts of platforms what I'm gonna be doing for most of the time and this talk is calling out the exciting ways that materialises gonna be different from these these things you might be used to using and some of the cool things that you can do with it that other frameworks yeah you know aren't quite active so now I'm gonna try to get you excited about that and and tickle your brain and get you thinking about like wow I've really I could do this that and the other thing that I've never been able to do before that would be a total so so let's just let's let's start actually what I thought I would start with what I really thought I would start with this is a live demo and I'm not gonna do that because I was it was implied that maybe that's an irresponsible thing to do so what I've done instead is I've captured a demo there's just yesterday I just started walking through some examples in our in our system against a live running installation I'm just gonna walk you through its into a little little video of this so materialist presents in a few different ways it's presents as a database and one way you can interact with that is through just this equal shell so we have we have a CLI here that we use where you log in and you get it's actually something like if you've used P sequel or these other service tli tools you know it's just an interactive setting where you can look around you can look at in this case sources these are places we get data from for example all of the my sequel things are a bunch of relations we've pulled in from my sequel instance nearby and they're showing us a bunch of a bunch of data inside them this is there's also some logging stuff that we have here don't don't get too excited about any of the specifics here but as you can see one of the things that you can do is just treat this as a sequel instance so you can count the number of records in each of each of these relations this case order line happens to be one of the the big relations in the TPCC benchmark this is a transaction processing benchmark and the order line thing is tracking a bunch of different orders that have flown through or parts of an order that have flown through some big warehousing system and we're counting the number of Records in it and you can see that you know some million odd records this isn't a particularly big installation some million odd records and it takes about half a second and there's no particular reason to think that that's great but you can do it so so so great so what so this is one way you can interact with the world what materials could try to encourage you to do those instead of writing de novo sequel queries over and over again it's gonna really try to draw out from you the concept of a materialized view or view it's gonna ask you to say hey maybe what we should do instead of writing this query over and over again is announced this as a query of interest and we do this in materialises you use sequels create view language to say we're creative you might count as an exactly the same query from before which has the nice property that when we go and look at it now we're gonna be maintaining this for you and when you look at it instead of being half a second it's tens of milliseconds or so and you can just update it look at it again you see the numbers are changing the data are actually does this database is changing it's undergoing a few hundred records a second of turn and we're giving out it turns out nice consistent answers here these all of the answers that materialized produces are going to be the exactly correct answer for the database at some point in time appointed time that moves forward so uh you know maybe this is interesting maybe not yeah you're probably thinking like okay count that's pretty amazing you know I could probably do count too so let's let's dial it up a little bit here and do a few more interesting examples so the next one the first query actually from the from the workload we're using this is a TP CH variant it's just a big select that uses the same relation this order line relation but it gets a bunch of sums and averages and counts out as well using various equal to diem some filtering thrown in there as well and it's all it's literally just sequel 92 nothing exotic about this this this query language you can see bunch of results up there don't worry about about reading their particulars we're just gonna go and instead of taking 0.8 seconds like we did here to the query we're gonna create a view and with the view created we're gonna see that it takes a lot less time essentially to go and dive in there and get the specific data out again again tens of seconds saying you know like 50 50 sorry tens of milliseconds about 50 milliseconds 60 milliseconds in this case you can see the data are changing and what we've done just so this is totally transparent we've pivoted the computation from one that goes and tries to pull data out of the out of the system and the standard sequel evaluation technique to one that reacts to change and we'll talk a bit about the architecture for that in just a moment once we once we've gotten through this these examples we've turned something that looks like a big database II computation and something that looks a bit more like a streaming computation a reactive computation that updates in response to change quite promptly now you are all very clever people so averages and sums and counts again not not the world's most obscene you could probably figure out if you needed to how to write one of these things yourself so so I'm gonna throw out some stuff which is hopefully a bit more terrifying here's another query this is a four-way joint you can see in the from field there's four different relations there customer's orders order lines new orders we're joining all these together with you can see about eight constraints down there on various primary keys and it takes in this case I don't know a second or so to compute the answer the data sets are not massive but we do exactly the same thing create a joint it's already created view my join up here that is the exact same query and now pulling out fresh results from this joint is the sort of thing that I guess we're gonna there's about three hundred results in this joint so we could try to limit it down to a small number is again no tens of milliseconds to pop the date out even though the joint is this weird four-way join that involves a dumb bunch of data shuffling and and and moving stuff around once we've transformed the computation into a reactive dataflow the ability to update it is actually quite brisk even though the original computation might have involved reaching around to all sorts of different sources moving data around shuffling it what-have-you so there's just a few more little little blurb see that was that was revenue ascending which is the wrong way to look at large revenue numbers of course that's the most exciting way to look at them so if we just spin the order around we haven't really changed it well we certainly haven't changed the the view might join very much so we can get interactive access to the view by descending revenue so we see big numbers we can off there's some other cool things which might not be obvious we can the joins in the counts we've described have been from base relations we can write views that depend on other views so here we're just creating a view that selects the average revenue from from the results of this joint this is now a four-way join reduced down into total revenue and then averaged across each of the distinct orders and when we go and pop stuff out of there we'll see again I hope that it's again tons of tens of milliseconds and you can refresh this to your to your heart's content it you know changes or it doesn't in this particular case let's see we've refreshed it and nothing's act changed and you've probably seen that before if you use streaming systems Wow nothing changed that's that's not super exciting you know how fresh is this data it's a legit question you might ask I haven't actually told you yet we're gonna see in just a moment I'll point out by the way the third one did refresh so it does things do actually change but it's legit to ask you know by how much is this trailing trailing our input and one of the fun things about materialized there's one thing I find quite fun is that we expose a lot of our internal system state through relations so you can actually introspect on this and ask what is the relationship between the the times essentially at which each of these views are current one of the views that we published is this perfect dependency frontiers and we're about to sort of pull out that data and that tells us for each of the views that we're tracking what are the essentially that the lags between the output of the view and the input to the view and we'll just sort of read this out here so you recall my Avenue which is right in the middle my average depends on my join it turns out that it has 0 milliseconds of lack at the time of reporting this view which means the results my average is surfacing if we were to query it or exactly the correct answer for my join my join well depends on a few things it was this four-way join and it's lagging a few single-digit milliseconds behind those input sources and like no joke exactly correct answer is strongly consistent these are exactly the results correct circa three seconds ago so so anyhow this hopefully gives you a sense for the some of the capability is something that materialized writing interesting complicated sequel queries involving joins of many relations and actually insisting on and receiving relatively low latency guarantees I'm hesitant to say real-time because everyone else is not hesitant to say real-time but quite quite fast faster than basically other other systems so okay that's that's the that's the demo I think it's worked out great I'm gonna show you a few other things so this demo was picked because I imagine that that several of you might might resonate with sore CLI a type tools or at least not be as allergic to them as as other people there are other ways that materialize presents so this is some of the some open source bi tools out there this is meta base in particular but because materialise presents as something that looks a bit like a relational database looks like Postgres currently we're using PG wire you can attach other types of BI tools to materialize and there they're currently programmed to think like wow queries gonna take about a minute or two I better not do them that often and in because in Midway's we had to dive in there and convince it to refresh once every second instead of once every minute because we can handle it just fine but if you imagine using the existing crop of BI tools if you're familiar with them in a much more interactive manner something where you actually watch the data change second by second instead of hitting refresh once a minute and scrubbing around in the data that's sort of the look and feel that we were going for so with this in mind what I thought I'd do with with the remaining time is talk you through a bit more of the architecture underneath the system to get you you know if you're currently thinking like wow that's amazing I wonder how he does that stuff I'll I'll talk through that if you're not thinking that ask you to watch the first part of the talk again and so let's talk a bit about the architecture we're gonna return to that picture before with the relational database it's gonna show up on this on a slide I've described materialized here and visually as three different layers don't assume anything about the ordering it just a picture works out better this way so at the bottom at the sort of the access point to materialise we have a bunch of clients these are the sequel clients a lot like the the CLI that we saw in the demo where someone just connects and starts typing similarly someone can disconnect and start typing sequel or in principle other sorts of interfaces if you're using something like meta base it's also going to connect as a client and start asking queries and building up session the state and prepared statements and stuff like that all these clients go in to a coordinator so this is a central access point for materialized and this is a moment where essentially all of the clients can you know indirectly interact so they share state a little bit they can reuse the same the same views and names with other clients so if I created my joint yelled out to someone hey Craig by the way who's also here from materialised sorry I should have said this you know in Craig's like oh sweet my abs equals da-da-da-da-da this works totally fine right so we share all these things if you have a you know this collaborative data analysis team that some people build some views and just ship them over to other people works works great the coordinator is what mediates all of that it's also what decides if three of us want to call something my join at the same time who wins and basically provides guarantees of consistency and durability and whatnot that an actual database it would be expected to provide and up top we have timely data flow workers and if you're familiar with the work that I do that's this is where I show up this is the data parallel compute backplane and that is what's going to get all of this work done all of the incremental view maintenance high performance I throughput stuff happens back there so when we look at how these are we interact with these layers the clients is sequel I put sequel actually this is not a sequel inspired language this is not sequel with a bunch of caveats where you can only do insertions it's not you know sequel where you need to know what's a stream and what's a table or anything like that its sequel you write a thing in sequel we give you the correct answer we maintain the correct answer no matter how your data change insertions deletions don't care so it's sequel 92 if you need with recursive talk to us but the intermediate representation that then goes into we do some standard database analysis at this point the world is a little different in the streaming space a lot of the analysis you would do in a database essentially are targeting other performance characteristics that then what we're worried about so we're very worried for example about this standing memory footprint of these queries we're gonna install this data flow it's a thing that a database doesn't necessarily worry about too much because queries are ephemeral so we're we're gonna be looking a little differently at these queries to make sure that the system doesn't fall over and then up top the timely data flow system uses a primary called differential data flow which is again what was work that I've done in the past that's really good at incremental maintenance of relational data parallel computation so things like joins and aggregations and and what-have-you and these connect to the outside world and sort of schematically this anyway we saw before there's a relational database that's that's pretty loaded with a bunch of work going on people are buying things from you or people are visiting your website or your logs are spitting out information all that information flows into materialized we're currently using a bridge called the museum that drops things into Kafka but there are a lot of ways that in principle you can get data into into materialized it's a pretty simple format the horizontal lines here are the data plane so this is meant to be this high throughput low latency and generally generally solid you know strongly consistent compute infrastructure going this way that is controlled by the control plane which is this vertical dimension so clients come in they talk to the coordinator the coordinator sets up some data flow and then the data just move through that as promptly as possible as fast as we can manage essentially and on the right hand side iya if you want write this out to something like Kafka that's great that the clients can also just interactively peek at this data and look at it and see what's in it but if you're looking for high throughput data transforms sending it back out to Kafka is totally sane in principle we could you know we could also present as a read replicas for for a database that's not something we're currently doing but we have lots of lots of options if you're if you're interested definitely let us know because we have a lot of we have some free cycles I think about what we want to do next so I'm gonna just sketch out is some of the cool internal architecture to the materialized compute backplane that makes it different from a lot of the other systems you might be familiar with different from things like let's say k streams are different from flink different from basically just about everything else one of the big important differences is that in timely data flow we're gonna be building up data flows I'm gonna start drawing pictures of data flows in just a moment but the data flows are all charted in the sense that we might have multiple workers let's say there's four workers in this picture each of them are going to be executing the same data flow there could be executing fragments of exactly the same data flow we're not going to partition up work between the workers when we say join all of the workers are going to help out compete in the join when we say reduce all of the workers are gonna help out computing the reduce there's a good reason for that and the example is gonna try to build up and show you what that good reason is so let's just let's just start out you might have create a source here orders I've just made this up but in in materialize you'd announce a source which is something like you know let's say a Kafka topic that's meant to hold some sort of changelog coming out of your relational database will run off and will grab that will start pulling in the data for you and drop it into a named thing orders here so it schematically these these green widgets represent materialized charted indices and this is one of the important concepts in materialized so these material I started indices as three words in there they're all important so the materialized in the sense that we have actually collected the accumulated changes that have gone on and we're sitting on so someone if someone comes and asks us now hey tell me about orders like I'm super interested to hear about how about orders what's the count of everything in orders we need to actually have that data around right if it washed past and we let it go we don't really have an answer for you anymore and we feel really bad like we're not giving you the correct answers anymore this is absolutely going to materialize the stream that has come through we're going to accumulate up all of the changes pluses and minuses and keep a representation of the current state of affairs for for orders and in fact a little bit of historical detail for it it is charted in the sense that the four workers in this in this picture are each maintaining a roughly one fourth of the data and if you want you can think about this is being charted up by like primary key or something like that if orders have a primary key we will have distributed those keys across each of the four workers and each worker is now roughly handling about a fourth of this data in their indices in the sense that they are and in every case going to be indexed by some important quantity if it's a primary key that's that's a great example will index them by primary key so in the future if people need access to this data by primary key a common thing in a join will just have have access to it ready to go with no additional data manipulation required so we'll see a few more of these material I started indices again they're a pretty important concept I'll call it a few more reasons that they're important but let's imagine someone someone shows up with another source order some customers and we then you know get asked to join them and reduce them and I just imagine that sales I had a bunch of other things but it turns out like putting sales and profit and stuff like that on the right hand side makes people feel really good about about things instead of like defects or what have you but um throwing down these these additional bits of data flow can be relatively cheap if we're already sitting on these materialized started indices imagine a world this is this doesn't make a lot of sense in this example actually but imagine world where orders and customers were both charted by their their primary key already and we were joining on these keys that doesn't make a lot of sense because their primary keys for each of each of the relations but bear with me we would not need to move any of the data associated with orders and customers around they're already co-located appropriately on their workers partitioned by key thank you and we are we're good to go essentially the operators hear things like join and reduce and whatnot are all designed to be these reactive implementations that as changes flow in from the sources and into these indices with indices also present changes outwards they're implemented as operators that are able to produce the correct output changes alright so as you know say five new orders show up the join operators give you smart enough to say who those might match with the customers I better look up and see if these five things hit over in customers and produce any new matches in the output if a customer's retracted for some reason I don't know why maybe they're not not so happy the joint will need to think about should I be retracting anything from my from my output the video is similarly will update their their aggregations or accumulations correctly at the same time so let's let's do a cool kiss of these charted data flows being being used excitingly I just put together another parts source and let's say now that I want joined that was sales sales is that materialized index over on the right there these are the same thing not everything not only are they conceptually the same thing materialists are actually the same in memory index so if someone wants to start using sales they just grab that pointer and start working with it which is really really cool sorry they I mean the materialized system in response to them typing join but we're able to just grab this this pointer and start working with it rather than reflowing the full contents of sales off to some other machine that's gonna build up another index of it anywhere anything like that so if we had to join hypothetically with sales and imagine that parts was a relatively small so the sales is lots a gigabytes and parts is a few thousand records turning on this data flow is a matter of milliseconds it's just of synchronizing between the workers we don't actually have to flow sales anywhere and the thousand records that come in from parts a thousand serve changes from the initially empty collection just result in lookups and sales and you actually get the answers back if there's only the say thousands of records and parts in again milliseconds or so so you're not actually regrinding through these large data sets that you've already maintained in indexed form and this is perhaps one of the most important points of departure from other systems in the space that have adopted you know to their credit these pure streaming architectures where there is no persisted or maintained state there's a downside to that and that means that is regrinding through these large volumes of data each time you need to use them so let me explicitly call out a few good things that these that these things do and then I'll wind down and we can sort of swing into question question answering asking and answering mode so the main point technical point I'd love to make here is that these material I started in indices improve several things so they improved throughput in the sense that instead of keeping five copies of of one indexed collection we're just keeping one that means that we do a lot less work as we're receiving your data we can handle higher loads of changes coming in they improve latency in several ways but we're latency is one that I just called out so the amount of time between you finish typing the query you press Enter and we're able to start giving you answers back is potentially a lot lower if we're already sitting on the index representations of the data assets we need to start answering the questions so this takes things from potentially you know multiple seconds down to milliseconds if we're working especially if you're interrogating large collections so if you're going in and saying like I'm really interested in this known subset of you know people who've complained there's sent me emails in the past day I want to go and look up what their experience was a relatively small set against the big set this will come back in milliseconds and continually update as opposed to you know spinning up a data flow yeah you can tell me how long this take a really important thing that I think of not a lot of people look at is query scaling a meaning as more and more queries show up so as your data analysis team grows if it's just one person asking one word count query great you know we can we can draw that on the board if it's one person asking five queries or twenty people each asking ten queries what are the resources required for that do you actually need to get in that case 200 times as many resources as for that one data flow and one of the really nice things that goes on to materialise is that if people are reusing the same data assets they can start asking more questions so as so long as you know the number of distinct uses of the underlying data aren't necessarily growing for at least as those grow thank you as those grow you need more resources but if people are just reusing the same customers relation the same source data files you don't necessarily need to to invest in a lot more resources there's a lot of really nice sharing that happens that makes querying pretty scaling increasing a larger larger teams without blowing your cloud budget all right so uh take-home take-home points last slide and then I'm gonna I'm gonna stop one of the one of the goals that materialized is to try to encourage people may not be the right audience but encourage people to replace some of the the microservice morass out there which is a lot of people writing thousands of lines of java trying to make sure they covered the various cases of people reading from this to the other to the extent that you can replace that with ten lines of something like sequel that just declaratively states this is what I want someone else is gonna handle making this work out correctly great you don't have to stop doing the stuff that you like but you can focus more of your attention on the hard cases let the easy cases be handled by my system that sort of means to do this well and put a bunch of engineering in it so outsource the complexity of the easy things that you can describe to us and there's a new cool thing that you can do that I'd love to get people to say anything about which is this interactive analysis against live data so a lot of prior systems either the data comes in and relatively slow chunks you know once every hour or something like that your data warehouse gets refreshed only so fast or in streaming systems where the data moves fast your ability to launch a new query and get results back is gated a bit by the time to build and deploy these data flows we're actually able to do both of these inter millisecond timescales so you have the ability to interactively explore data and learn about it and understand it and ask what question do I need to ask next and respond to the data that I'm just seeing interactively I get a lot of you are probably familiar with DevOps style problems something is on fire in your computers how long does it take you to figure out what the answer to that and you know you start asking some questions you really want to know what the current state of affairs is as soon as you see that you think of a different question to ask all right there isn't just a standard questions you you need to know this is something that we think is is really opening up now and we love to try to get our head around what are all the exciting use cases here how much power does this give people what would people do with it so I'm gonna wind down there I apologize I went a little bit late but there's some contact information you can read all about us here at material I said oh actually there's a few places to check out online macelli's is one that's that's us the organization the underlying compute substrate is all publicly accessible open source all that stuff you can find that at at timely data flow on github I'm Franklin cherry chief scientist at materialized feel free to contact me using my very professional email here and if you'd prefer to be unprofessional my twitter handle is down there and this is a great way to bait me into to responding to your concerns in a more public forum but I'm gonna stop with that I appreciate the time you've given me and we have I believe some time for questions on a very left site that's like one database can you materialized its data from like multiple databases yes okay good the question is if you're just mirroring one database the story might be relatively see what happens with multiple data sources it's a great technical question operationally the underlying compute substrate can there is the differential data flow substrate hensel's is this cool thing called multi temporal data flow where so if you like think of the timestamp seems to these systems as the commit IDs like a GT ID and in my sequel which is sort of you know consistent view of the database consistent view the database consistent view of the database if you've got two of these that are marching forward that are unrelated you need some sort of story but how do i how do I talk about these two different notions of time at the same time and the as best as I understand the academic underpinning of this is this concept of multi temporal analysis we have essentially two dimensions of time they happen to go along largely together but you need to think about them separately the underlying substrate supports that materialize itself at the moment does not our times are integers but it's you know easy enough to to think about changing that around as people show up and say this is this is crucial for us I don't know if that's the short answer is not right now that we'd love to know if that's important we totally could do this given enough time and like if everyone the room said yes that's what we need we would absolutely go into it right away thank you for you to talk about was great one question maybe two questions be honest and the first one you mentioned that you stream or you get change lock you mentioned change lock flu cough cut do to your system to materialise and the question is do you mean by change look at the right a headlock which is a technical term for from like the change change that in relational databases and if yes then how you monitor that how you really like string that how you connect to that to you know to push it into Kafka so the first question and the second question is the demo was great but I noticed that when you create a view it takes from tens milliseconds and is that real I mean or is it just you know the the the execution of the parsing SQL and then in the background is the whole like real materialization of the so great two questions so the first sent one has a very easy answer there's an existing project called the Museum that means to be a bridge from several different existing databases into a Kafka changelog style feed so they're monitoring the the bin logs and depending on the the particular database the right ahead log in some cases doing a little bit of magic there because not all of them present exactly the right information but they present a stream of for each record that before and after if you have that record in the example the demo the view creation absolutely you're absolutely right that is setting up metadata and it's it's building the data flow and it's starting to stream data in so if I were incredibly fast at typing I could have created a view and then then looked at it immediately and you might have seen data that was a bit stale because the view had not yet fully populated this is this is what we use these like the last part of the demo was information about the lag on these things and that's something that would initially at the moment of creation show Wow quite a lot of lag you know like we often see something like you know 47 years of leg I guess 49 years of like now because when it starts at 0 that's you know you next time and and but pretty quickly these things fill out sorry you're behind someone these things fill out and absolutely you want to dive back into that that logging reporting relation to make sure that you you see the right stuff there [Music]
Info
Channel: Data Council
Views: 4,016
Rating: 4.8805971 out of 5
Keywords: machine learning, computer vision, AI, big data technology engineering software engineering software development
Id: zWSdkGq1XWk
Channel Id: undefined
Length: 35min 40sec (2140 seconds)
Published: Tue Nov 26 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.