CockroachDB: Architecture of a Geo-Distributed SQL Database | Cockroach Labs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] over the the past two days we've heard a lot about data science and machine learning and analytical workloads I'm gonna flip the script here and close out with the talk about large-scale transactional workloads and the systems that make them possible I I saw Frank Marie's talk earlier today about materialized and he had a little red square at the time kind of at the bottom of the screen there were supposed to be the our TMS our DBMS of the of his system that's what we're gonna focus on here so that little red square can I get a show of hands for people who have used transactional database in the past awesome okay so kind of everyone here how many people have used a distributed transactional database in the past still a lot of people that's awesome so today I'm gonna take you on a whirlwind tour of the architecture of one such distributed OLTP database cockroach TV and my goal is for people to leave here today with a deeper understanding of the inner workings of these distributed sequel databases so here's the three word pitch for cockroach IV it's make data easy and this is actually the mission statement at cockroach labs the company behind cockroach TB our thesis is that too much burden has been placed on app developers traditional sequel databases have failed to scale horizontally or survive failures or be able to replicate kind of across the world in in in today's application landscape these are kind of requirements for developers now no sequel came around and offered horizontal scalability but it gave up transactions and indexes and in some cases even correctness which in turn put the burden right back on app developers so there's been an emergence of this new class of new sequel databases that promise to bridge this gap they offer scalability without giving up on consistency for diving in though first I'd like to introduce myself I'm Nathan Ben Ben Schoen I'm an engineer at cockroach DB I've been working on it for about four years first as an open source contributor and later as an employee of the company so here's our agenda for the talk we're gonna start at the bottom of cockroach TVs architecture and work our way up so at the very bottom of cockroach TB is a distributed replicated transactional key value store that's a lot of buzzwords but we're gonna go through each one the first thing to realize here about this key value store is that it holds keys and values and these are arbitrary byte strings but later we're talk we'll talk about the structure that higher levels in the system and Poe is upon these keys and values so cockroach TV uses multi-version concurrency control and it uses a hybrid logical clock scheme to version these keys and values so that means that values are never overwritten they're instead written with a new version which improves concurrency within the system and a lot of databases nowadays use this approach the key value store exposes a single monolithic logical key space and this key space is ordered based off of key here I'm showing just the keys for clarity but in reality they're values associated with each key and in fact multiple versions of the value associated with each key so the ordered monolithic will key space is logical but it's physically realized by breaking up the key space into 64 megabyte chunks which we call ranges these ranges start empty they grow as data is added they split when they get too large and they merge back together with their neighbors when they get too small so conceptually you can think of cockroach should be is starting with a single empty range and just growing as users add data data is placed into these ranges using order preserving data distribution we don't use a hashing scheme like consistent hashing and this is actually very important for layering sequel on top of this key value store the sequel demands a lot of efficient key value scans that are ordered within a range the keys are still ordered on each physical machine using a per node embedded key value store and Kakashi B uses rocks DB for this this purpose so in order to maintain the ordering between ranges a range indexing structure is needed we store the structure inside a set of well-known ranges within the system so that anyone can locate the top of the structure which ends up looking kind of like a bee tree this provides a two-level indexing structure it's similar to BigTable HBase and spanner so the fully ordered keys enable efficient scans of ranges here's a depiction of scanning between a dog's table between the names muddy and Stella and you can see that we're using an indexing structure to look up exactly what ranges store that span of keys and then scanning just that over those keys so now that we've talked about how data is organized within this key value store we can now talk about how reads and writes access the data to do this we're going to need to switch gears and start talk about replication for those unfamiliar replication means keeping multiple copies of the same piece of data in sync across machines cartridge GB uses raft for replication raft is a distributed consensus protocol that's similar to paxos now how does concrete reuse raft well each range within the system is its own raft group and I'll say that one more time is its source of confusion every single range within cockroach TV runs its own draft consensus protocol to keep its portion of the key space in sync across multiple replicas we don't replicate at the level of nodes we actually replicate at the level of these ranges and by replicating at the level ranges we're able to get fine-grained control over data placement across the cluster the distributed consensus algorithm provides a very useful building block for our key value store not being atomic replication commands which change say are proposed by a leader distributed to followers and only accepted when a quorum of these replicas acknowledge the request so in raft a quorum means a strict majority here we see that we have three replicas holding the same pieces of data when a write comes in it's distributed to all three replicas and only once two out of the three replicas have written the command to disk is it acknowledged to the client and this means that even if some of the nodes here fail the write will still be present in the system so this is an important part of cockroach TVs resiliency story as it means that we can lose machines and still stay online so quorum replication is reasonable for rights but it's actually fairly expensive for reads where no state is changing consider this example of reading a key from our dog's table using a consensus read would mean we would have to go to all three replicas wait for a majority or to respond and piece together the response to the determine what was the consistent state across them the replicas this would slow down the reads and impose extra network traffic throughout the cluster it would be pretty nice if we could just go to a single replica ask for its local state and return that to the client but we can't just read from any replica because that repla might be behind it might not have the latest State forgiven key yet the the replica that coordinates these rights which I've called the leader so far but we're actually gonna call the leaseholder from this point on is always a way of what rights have been committed so it can service reads without communicating with other replicas in the system so we say that the lease holder has the ranges lease and these leases give cockroach to be consistent reads without consensus in an edition the lease holder implements some of the concurrency control mechanisms used by transactions which we'll talk about later these include key range locking and deadlock detection so I said the data and cockroach TB is stored in ranges and these ranges are replicated and that can be stored across any nodes in the cluster so where do we put the replicas well this is the replica placement problem and there are four signals that cockroach EB uses to determine where to place replicas in the interest of time I'm going to skip over the first heuristic here these it can be summed up as we try to balance space utilization across the cluster we don't want any one node running out of space so let's serve with diversity base placement for a distributed system diversity improves availability when replicas are spread across failure domains failure domains are any unit of failure such as a disk a single machine a rack a data center or even a region for obvious reasons we wouldn't want to store the same piece of data on the same disk because if that disk went down we'd lose both copies so cockroach TB attempts to spread replicas across as many failure domains as possible while adhering to admin preferences for where that data can and can't be now the next heuristic is replica for replica placement is load you might imagine that spreading out the space usage across the cluster would be enough to spread out load but that's not really the case for instance I said earlier that the reads are only serviced by lease holders so if we put all the lease holders on the same node then all of the reads would be going to that one node then we'd have a huge imbalance in CPU utilization and network traffic so the first mechanism here is to spread out lease holders across the cluster unfortunately load isn't evenly distributed across the key space either for instance in this slide range that the blue range here has a lot higher load than the rest of the key space and we see this in practice all over the place customers often have reference tables that are accessed on every single query and those tables end up being a lot hotter than the rest of the system so Congress GB tracks per range load metrics and then balances load for those ranges across the cluster in this case we see that the blue range ends up on its own nodes because its axis so frequently and in practice there are thousands of ranges per node but this heuristics still very important in the final heuristic that is used to determine where to place replicas is latency so the distributed key-value store is actually a geo distributed key-value store and the geo part is very important here these geographical agencies can be very significant from tens of milliseconds for small hops to hundreds of milliseconds for large hops so cockroach DB implements two mechanisms to place data where it's needed the first is a manual mechanism where a user can specify for instance that all the replicas for range must reside in one locality or they can specify that the leaseholders for one range must be in some locality for instance they might specify that the leaseholders for all ranges containing users data must remain in California because its first being accessed the other mechanism is automatic cockroach TV attempts to move lease holders and replicas close to where they're being accessed from so this is called follow the workload and allows cockroach to automatically move data to where it's being accessed and adapt to changes in workloads so moving up the stack the next thing we'll talk about is transactions is this a I said this was a transactional key value store cockroach implements serializable isolation and the topic of isolation levels are confusing but fascinating it's a very deep topic and yet serializable isolation is kind of easy to think about it's fairly intuitive concurrent transactions behave as if they were executed in some serial order and there's no interleaving of operations between them so again this goes back to making data easy these transactions cockroach can span arbitrary ranges of keys they can touch the entire key space if they want to and they can be accessed using a conversational interface this conversational interface is important for sequel these applications often will start a transaction read some data manipulate it in their application and write it back all within the same transaction so let's dive into the implementation of transactions briefly transactions provide a unit of a thermos City they're all or nothing they either succeed or they fail altogether in transaction level a de minimis city in cockroach TV is bootstrapped on top of range level atom isset e provided by raft each transaction has an Associated transaction record an update to the transaction record are stored on a single range so we're able to in direct through this transaction record to provide ad Amissah T for the entire transaction I'll walk through a simple example and here we have an insert statement that's inserting two rows into a sequel table which is backed by a four node cluster the insert statement is inserting two dogs into our dogs table and the sequel execution layer will turn this into a series of key value operations in this case we're writing the keys Sonny and Ozzy and inside of a single transaction the first thing that a transaction does when it begins life is it creates its transaction record and it will store this on the first range that the transaction writes to the transaction record really only has one piece of information just the status of the transaction here it begins its life as a pending transaction and later it will move into either a committed status or a final or an aborted status so each write the transaction performs while it's running is called a provisional intent the intents are tagged with the transactions ID and if another transaction encounter is one of these intents it has to hop over to the transaction record to determine whether that intent should be visible or not so we see that both the transaction record and the intent on Sonny are replicated amongst this first range here and because we're using raft would wait for a majority of replicas to acknowledge the change before continuing on the sequel layer then moves on to writing the second key which might be on different range than the first key so it goes to the leaseholder for this second range the the range in blue and inserts an intent on Ozzy and again the leaseholder proposes this to its followers and followers acknowledge finally the sequel gateway which is the the node processing the sequel commits the transaction which involves going back to the transactions record and marking as committed this operation is again replicated for durability and when that's complete the transaction is considered committed so the sequel kqa can go back to the client and say your transaction is complete can acknowledge the client and finally the background process cleans up intense so that reads of them don't need to go visit a transaction record but note that even if it didn't succeed in time if it didn't clean up these intents in time reads of these intents would go to the transaction record find that the transaction is committed and treat the intent as if it was committed so I've talked a lot about keys and values but concrete GB is a G distributed sequel database so now it's time to talk about sequel cockroach implements the sequel that you're familiar with kind of standard sequel from schemas to indexes to transactions to foreign keys the dialect of sequel that we implement is actually exactly post-crisis and the wire protocol that we speak is exactly post-crisis so your client application thinks it's talking to a Postgres database when it's talking to cockroach gb the goal here is for users to be able to pick up active record or hibernate or whatever tools they're used to using and plug them right in without missing a beat so I'm sure most people here are familiar with sequel sequels declarative in that you describe the results you want and not the steps for how to return those results and this is important as it provides a tremendous amount of flexibility within the implementation of sequel but that is also a large burden for the sequel database implementers sequel is tabular and tables are composed of rows and columns the columns are to find in a schema and these columns have types like float or string tables have optional indexes that provide efficient look up into the and they have foreign keys to enforce referential integrity but wait a minute I said sequel data has columns and types and yet we've talked only about untyped key value pairs this seems rather distant but it turns out it's it's not too distant we can map sequel directly onto key value pairs without all that much work and we'll see how let's start with a basic table here in cockroach and in most sequel databases every table has a primary key which is just a special index this required to be unique and required to store all of the columns in the table each row in the primary key maps directly to a single key value pair and each row in a secondary index also masks just to a single key value pair for each index the index columns are encoded into the key of the kV pair and the non index columns are encoded into the value of the KB pair we can see this here you can probably imagine a scheme for encoding the value here we could have used protobufs or Avro or any other encoding scheme in practice we chose to do our own encoding for performance but it doesn't really matter what matters much more though is the encoding of the keys we need an encoding such that the encoded key is ordered the same way the logical columns they encode are so with our inventory table here we need the encoded ID column to be encoded with the same ordering as what we logically consider integers so people here might be able to come up with an encoding that would work for that requirement and I'm just going to gloss over the fact that it actually is possible to do that this with arbitrary tuples so we can do this with arbitrary data types and arbitrary tuples we didn't invent this idea but we use it heavily during this mapping from sequel to key value stores to key values so cockroach supports multiple tables and indexes by just prefixing each key with the table and the index name so this what it looks like for an inventory table here but sequel's a lot more than just table data storage it equals a powerful query language and since cockroach is a sequel database we need to implement that query language I'm going to go through a quick overview of how basic sequel query execution works before diving into distributed query execution and finishing up with the mention on query optimization so sequel has a relatively small number of basic relational operations as projection selection aggregation and joining in just a few more these are expressed in sequel via constructs like select where group by and join these relational operators are part of relational expressions in a query tree or query plan for a sequel query is just a tree of these relational expressions pieced together in executing a query plan just runs this runs these operations to completion so I'll give a brief example of how sequel query execution works with this statement here here we you see we're reading from an inventory table and filtering on the name column so what this looks like under the hood is that we start with the scan we start with the source of the the data and we pass this through a filter operator so we're filtering based off of the name of each row the resulting filtered rows are projected out to just show the columns that are desired here in this case just the name column and that result is returned to the client so it's all actually fairly straightforward but here it's actually fairly inefficient as well people might have picked up on the fact that we're performing a full table scan here we're scanning over the entire inventory table before we're actually filtering at all and without more information that's really all we can do but what if we had an index directly on this name column well if we had an index on the name column we could scan a continuous set of keys and perform the filtering right there in the scan we could push the filter into the scan itself we'd still have to reject out the name column but this is a fast operation and then we would just return the results of the clients so the decision to use such an index is called a transformation and it's performed that by the sequel optimizer I'll dive into that shortly at high level these relational operators are straightforward but there are still two fairly large challenges we have to deal with the first here is correctness user-defined schemas mean that there's a lot of generality involved in sequel and that results in a lot of bookkeeping and making that bookkeeping fastest is tricky and that leads into the second the second challenge here we need to make this fast users expect sequel execution to be performing now there are three tactics that we use to make sequel performant the first is fairly straightforward we try to write write tight well-written code that doesn't allocate during individual rows and it doesn't waste a lot of CPU the next tactic is something that most sequel systems will do where we specialized operators to specific scenarios for instance a general purpose group by operator needs to handle inputs that are unordered and it typically does this using a hash table but if the input is ordered then often we can skip the hash table and and just stream the results back up lastly being a distributed system means that we can distribute sequel execution entirely so I'm we're a little short on time so I'm gonna skip straight to this last point with the geo distributed database network latency and throughput become very important considerations and with a distributed sequel execution we're able to push fragments of computation as close to the data as possible to avoid these slowing down queries let's look at an example of a sequel query that's aggregating rows using a group by operator if the table is distributed across three data centers then the query will instantiate three processors to scan the table these processors sit right next to the data that they're scanning and the output is fed right into a local group by operator it's still right in the same data center finally the per data center group by operator output is fed back to a single final stage and that's what's returned to the client so this reduces network usage over these wide area network links and also offers an opportunity for parallelism which is beneficial even when you're running a system like cockroach within a single data center so moving on to our last topic and I think we are short on time yeah so I said the sequel queries are declarative and that it's up to the single database to decide how to execute them and this ends up being the job of the sequel optimizer so there can be a lot of logically equivalent plans and it's up to the optimizer to choose which one to run now there are there are four stages here to running a query we start by just parsing the query and turning it into an abstract syntax tree next we prepare the query by folding constants and checking types and doing a lot preparation work finally we use a cost-based our cost-based optimizer to search through the different transformations that can be possible on this query to decide if there are any better query plans that could be made and then we pick the lowest cost query to run so that was an overview but when it comes to us who want sequel optimization they're actually two types of query transformations there are cost independent transformations and cost dependent transformations I'll start with the first one here so cost independent transformations our transformations that always make sense to run on a query and again we're searching for the best query plan to run so we always want to run these transformations an example this is twofold constants before we actually run the query we don't want to be doing this per row so we want to do this ahead of time now a cockroach these are all defined in a custom DSL and the DSL allows us to match against patterns within a query plan and then perform the transformation right in line there's also another set of transformations that aren't universally good some of them produce faster query plans and some don't internally cockroach calls these exploration rules these they're exploring the set of logically equivalent query plans so how do we decide whether to apply one of these rules if they're not always good well we actually always apply them and then we store both the pre and post applied version of the query plan so if you're thinking that that could get expensive fast then you're right but this memo structure allows it to be kind of feasible to do this and then we use table statistics to estimate the actual cost of each of these query plans so here the last example of what this looks like I'm gonna skip there here we have a query that's scanning over your table a filtering based off of a column X and then ordering based off of why the question is what query plan do we choose to execute this query the initial ast gives us something like this or we're just gonna scan filter and then sort and this is pretty basic but this would work if you have an index on column X then we can push the filter into the scan and we saw this earlier and this would speed up the execution and if we have an index on Y we can use the B ordering in the index to avoid this sort so this again could speed up execution but it's unclear which of these last two query plans are better what really it really matters how many rows are passing between each stage of this query plan so for instance if the for if filtering brought us down to just 10 rows then the scan would be very fast so we'd want to choose that sorting 10 rows is very quick but if filtering left us with half of the rows then we wouldn't really save much by using the index on X however we would avoid the entire sort if we use the index on Y so we'd actually want to choose that one and we really can't know ahead of time which one will be better but we can use table statistics to get an estimate for what one will be better and then pick the cheaper plan so something that cockroach does because it's distributed that most systems don't have to deal with is that it actually takes locality into consideration in its cost model so just like we were using table statistics to determine which query to run before taraji actually uses the locality of the data that's scanning to determine which query plan to run here I've defined a table called postal codes that has an index on ID its primary key and actually there's two different indexes has three different copies of the data in all three regions within the world this is a typical reference table that foreign keys might be pointing at the optimizer is able to use the locality of the different indexes to determine which of these indexes to run queries against to avoid wide air and network hops and this is a simple example but the mechanism is general and we get a lot of use out of this so let's review internally cockroach contains a distributed replicated transactional key value store the Cavey store exposes a monolithic key space that's physically broken down into 64 megabyte chunks replicas within these chunks can be placed on any node and a replica placement a set of replicas placement heuristics determine where to place these replicas tabular sequel data is mapped onto this key value store sequel execution is a lot of bookkeeping along with a heavy dose of performance and distributed sequel execution it moves partial sequel execution as close to where the data resides as possible sequel optimization attempts to find the best execution plan for logical query and locality aware sequel optimization recognizes the importance of latency in determining the best sequel execution plan and that's really it so at this point you've kind of all walked through all the architecture in a distributed sequel database and you're all certified to go build a sequel database so congratulations I thank you for this deep dive into database architecture it's great could you comment on how cockroach to be fits into big data ecosystem what are the potential alternatives that you might be replacing because of the benefits you offer versus other tools that may be in the more complementary for example like analytical databases or sequel engines right how like things like I'll presto would interact with cockroach DB yeah that's a great question so cockroach DB is definitely focusing on OLTP use cases and so there's still a big need for an OLAP system often to sit alongside one of these OLTP databases so we work with tools like snowflake DB to kind of stream changes from cockroach straight into analytical databases there there is work to make kind of white analytics work within cockroach but it's it's not designed for that tool yeah how does cockroach deal with Gateway failure during a transaction yeah so typically deployments of cockroach will have a load balancer out in front so you'll be balancing sequel to different nodes in the cluster but if your gateway fails while you're executing a transaction that transaction will be killed and it won't complete but the the client application will get an error and we'll just have to retry on a different gateway so you mentioned a cockroach collects a memo of the possible query execution plans and really like runs through a new iterate on it and figure out how long it should take how does that move feed into the query executor on top of statistics because based on example you provide it looks like statistics as more around number of records in this range and things like that yeah so I think I mentioned statistics twice in this talk one was about load on arrange but we also keep kind of per column and per table tables to the statistics which work just like a traditional space optimizer so within the memo we'll be pruning plans based off of the expected number of kind of the expected cardinality to expect through our query plan I see so you don't actually track the action but you mentioned that so the third kind of metrics I track is actual query execution time yes that feedback into the statistics right now that doesn't be back into those statistics now [Music]
Info
Channel: Data Council
Views: 10,926
Rating: 5 out of 5
Keywords: machine learning, computer vision, AI, big data technology engineering software engineering software development
Id: OJySfiMKXLs
Channel Id: undefined
Length: 37min 41sec (2261 seconds)
Published: Mon Nov 25 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.