System design interview: Database Scaling

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in system design interviews and system design in general databases and scaling data is probably one of the most difficult problems that you will face and in my opinion being able to scale databases and design systems and architecture around that that are durable fast reliable isolated all that stuff that's key to understanding system design interviews succeeding in those interviews and also succeeding when you're designing these systems on the wild this is john codes and in this video we're going to be doing a deep dive into the systems of scaling databases so first and foremost what i want to talk about is acid and no dang mind out of the gut this is a family show we are not talking about drugs we are talking about the acid acronym in database scaling so what this stands for is autonomy correctness isolation and durability and you're going to hear about this a lot and i think some people like like talking about this in interviews so it's really important that you know about this and even database systems will talk about you know different things they prioritize in this list so i bring this up it's a bit luxury and computer sciencey but i do think it's really important and it does come up in interviews so we'll blow through this real quick and i'll explain each point and what i believe to be the most important one so this one is really all or nothing basically all of the statement in a query in a write should succeed but if one parts of it fail then none of it should go through that way we don't leave the database system in a state that is partial or not complete next one is correctness and this one should really go without saying really what we're talking about is when we do something it's in a valid state and we leave it in a valid state so valid to valid if we do a write if we do a read that shouldn't leave the database system in an invalid state we should generally be correct at all times now this one can be tricky a lot of database systems prioritize this one because they want to return correct data at all times and not data that may be false or incomplete isolation is next and really what this is talking about is any mutation on the database or change on the database that happens anything that happens concurrently should be basically the same as if it happened all separately sequentially that way queries are essentially managed in isolation so that they don't affect each other and finally spoiler alert this is the one that i believe to be the most important is durable data should be reliable and database systems should be reliable and we shouldn't be in an instance where we lose data a database crashes and is down and all of a sudden we've lost a bunch of user data the world is over and our business is done and i really do think a lot of businesses do prioritize this because it looks real bad it looks real bad when companies lose user data in some catastrophe and production all of a sudden is completely mutated and things are not correct anymore i really do believe that durability is the number one thing especially in system design interviews that you should target okay so just to start off let's just create an arbitrary example for ourselves we're going to have a server that server is going to read write to some database now remember this is from a very high level this could be a web application or something but in this video we're really just diving into the concept of scaling databases and some of the strategies and things you might encounter around doing that so we're mostly going to be focusing down here on the actual database and what that might actually look like so really a few of the most arbitrary things that we can do initially to make some of this more optimized and scalable is do some vertical scaling of our database here now if you remember vertical scaling is simply bumping up the memory the ram the cpu we're just making this machine this machine this server that this database is running on more beefy a little bit better and the next thing we can do is right here we can do some just arbitrary query optimization stuff now this could be a whole host of things this could just be making the queries that the actual application here is making a little bit better a little bit faster this could also mean that this connection right here this client library that's connecting to this database maybe there's a more optimized version out there maybe we change that out get rid of this one for a more optimized connection a lot of stuff you can do there a lot of stuff that you can do to optimize the queries and the connections and the client libraries and all that stuff but note that that is also dependent on the type of database you're using the same strategies you're going to use to optimize queries for a mysql database may not also be the same for a postgres database so just like we talked about in the previous video you want to make sure you know the tech you want to know the databases you're talking about so do yourself a little bit of favor before your interviews or just to brush up on the topics look at nosql databases these are generally non-relational databases like you're gonna look at sql relational databases and there's a whole host of other ones that have their own special little features like for example cockroach database is a fully distributed built to be reliable database that doesn't really fit into its own kind of category so you're going to want to know about the tech and know about some of these things so that you can talk about them and sort of design around their features and the things that they do i think the next thing that we can do is have another connection and this would be another read write connection to any number of other databases that we might have we'll just put dot dot dot we'll just put plus n and so that also means that for connections this will be plus n connections for as many databases that we're going to have so now what we've done is we've essentially clustered our reading and our writing into multiple databases and but that's also multiple connections so what you need to do over here on the server side of note is you need to dedupe your data okay essentially what's happening in the server side then is if we're doing multiple reads and writes of the same data we would need in the application logic to deduplicate that we do get some reliability on the database side but we're not really getting much from the server application side it gets a little gross when you have to start deduping so much of that data so there has to be a better way i think that leads us to our first like real actual scalable option for most databases and that's what i like to call the manager worker model so this first database here we can just imagine as our manager and the manager then knows about several other databases let's just say that it knows about two other databases so database database now what the manager is going to happen is it's going to get read and written from so then what that manager can do is it actually can duplicate that data down here to the other databases you see so we're getting that durability we're getting that reliability and let's just put a w these are our workers essentially and there's lots of different ways that you can do this one other way you can do this is you have a server server talks to a load balancer and that load balancer distributes load across many worker databases and then those worker databases go back up to the manager database both of those there and what that gives us is essentially the consistent reliability layer is the manager right here the workers get written to read from they're behind a load balancer so we have consistency across all those and then the manager can then read and write to both of those so let's say that some data came in this way to the worker it gets written that gets propagated up to the manager and then the manager's job would be to propagate that down to the other worker that way both workers eventually have the same data and can be consistent across both of those instances so again this is what i like to call the manager worker model and there's again lots of different ways you can do this but this model gives us kind of the capability to expand even more and build on top of this abstraction so the next one we want to look at is a circular manager worker model so what that is is we might have some load balancer and that load balancer can distribute to any of our worker nodes or manager nodes but then any single one of those nodes can act as a worker or a manager and it sort of ends up looking like a giant circle so we might have a database node here database node here let's just give it four for fun and database node here let's say that the load balancer uh for just picks this one it just happens to land on that one this one right here we're gonna have a read or a write and then it will create eventual consistency across all the databases much like if you had written to the manager from that single node it could bounce it back over here like goes just around the circle that one talks to this one that one talks to this one and then eventually it makes it back and we say oh i already have that data i've completed the circle all the data is consistent across all of the databases now i hope you notice here that i'm talking about several read and writes between the databases in this sort of manager worker circular model and this was the same case for the last worker manager model as well and essentially what this is is eventual consistency versus latency now those multiple read writes will take a little bit of time they have to make it all the way around the circle to multiple databases and that data won't be completely consistent across all the instances for a bit of time a bit of latency and for the most part this latency is okay as long as it's not crazy but this latency is okay as long as we are creating eventual consistency and most companies are like okay with this this is like an okay trade-off they're okay with this latency as long as we're creating the ability to scale the ability to have durable reliable data across the layer and make that trade-off for the latency now one thing you can do that i think is worth noting right here is somewhere in here somewhere in this whole stack maybe behind the load balancer or maybe the application itself is going out first to check but you can create some kind of cache or a memory cache as it's commonly called and these are products i think the most common one these days is redis and what this layer gives you is a quick memory caching layer we don't have to go all the way to the database to complete that query because it's been stored in some kind of fast memory caching layer here so let's say that my application will just use the zoo as an example again let's say that i make some kind of get for zebra that's going to make it down here for read it's going to hit the load balancer it's on the caching layer yet it's going to come down here get read from the database and that could come all the way back up here the load balancer then and the reading layer would put that into the memory cache for us we'd have to implement that that when something gets returned it gets placed into the memory cache and then that eventually propagates back up to fulfill this get request so let's say that happens another time a second time like five seconds later we go out we check the memory cache and now since we've already read from the database and placed that into the cache it's there and it will be super fast and returned very quickly now you can configure these caches in all kinds of different ways to return stuff you know days out or weeks out it really depends on you know again this is a money thing the cost caches are not cheap to run especially third party ones that run in the public cloud but this is an example this is an option that you can do to make that latency down here a little bit better you can add a memory caching layer so we're scaling up things are happening we can have a sort of consistent layer in our circle worker manager model and one other thing that we could consider doing is adding partitions now all a partition is really is just some segment of the data some slice of the data and this can be really tricky you really have to determine what those partition segments slices of the data you would want to segment from each other to potentially create a better scaling option for you so what do i mean by that let's let's use an example let's go back to the zoo and one partition slice of the data may just be all the animals another slice may be all the employees and so on and so forth you see what i'm getting at i could basically slice up the partitions by tables and just make it super simple where okay animals are going to be in one partition employees in another partition so on and so forth but you can get really complicated with these because what if there's a lot of queries in our applications between the relationships with employees and animals it might not make a lot of sense to partition those away because then you're creating more load for your databases to create those tables and join those together across many partitions let's just go with this example for now we'll just say animals employees and what i might do is have a database system here we'll just put this in a box because this could be its whole own scaling thing and this will be just animals all animal data will live in this database system here and then we might have another and this would be its own scaling database system thing and this could be employees we're going to say all employee data lives here in this database system so you can start to create partitions not only within a single individual database but actually separated completely by databases now again like i said this can get really complicated really quickly because say there's this relationship here that happens between animals and employees well how are you going to do that how are you going to join those tables together join those partitions you'd have to create some kind of reading writing layer between these database distributed systems so then they could return that data individually that may even happen at the application layer but then at that point you're doing database joins and all that stuff within your application layer blah gross don't want to do that so there are database systems that make this a little bit easier actually one of them that comes to mind is maybe fcd the raft algorithm rethink db a lot of those actually have existing algorithms and features that make it really easy to do something like this so again know the tech now along that same thought the next one we want to consider is sharding and just as these things get more and more scalable they get more and more complicated sharding can be extremely extremely complicated essentially what sharding is is you might have some shard key say it's one two three and with that key you will always always return the same subset of data that's sharding in a nutshell it's not that complicated it almost is like a hash map you can kind of think about it with one key you will always be returned and a hashmap the same value but with sharding you have a key that could return a subset of data within the database so usually what this looks like is you have some load balancer something maybe like nginx and you can have the load balancer accept the sharding key okay and that key then can go out to the appropriate database to retrieve your data and this might be its own again individual scaling database system now this can just get insane where you might have plus 50 hundred of these instances really it's n plus whatever and you would need that amount of sharding if you wanted to scale to hundreds of millions billions maybe even users that way your queries could still be fast because your sharding key is basically a locked key that will give you that subset of data and that subset of data would maybe only be one in a hundredth of all the data so that instance could still be pretty fast individually i will say this is probably the most complicated thing you could possibly try to implement but it is the most scalable okay next one we want to talk about is data centers now all of these that i've talked about should be pretty easily stacked on top of each other in the sense that you know we're talking about these database systems you could really implement any of these as an individual system within this next higher order thing it can get really nitty-gritty when you're starting to do this kind of scaling with this amount of data so i would say it's okay to create you know just those simple abstractions where say here is my database scaling system we can assume that this thing is going to be able to scale individually outside of this broader system we're talking about okay so we're talking about data centers here and a data center really it would be a whole thing a whole thing in itself in a single region much like aws regions where we might have a whole database scaling thing a caching layer even the application logic so maybe this thing would even move in here into its own data center and really what this gives us is let's say this is us east 2 and let's say that i'm in that region i'm in the united states and i'm reasonably assured that all requests will be routed here then all my data and the whole data system will be here individually that way we can split up data centers this one here could be eu there may be one in asia somewhere et cetera et cetera and these are their individual own things that we don't have to worry about holding and replicating their data because i'm a user in us east 2 i just come over here and it's a smaller subset of the data that can then be replicated and sharded or managed on its own now this creates some of its own really interesting maybe tricky data region problems so for example the eu has you know different laws than the united states does especially even asia one example here is that all data in china has to be hosted and managed in china so you wouldn't be able to serve requests to users in china with data that's being stored in the united states for example this can create some really difficult challenging hard to solve problems around how you manage data at such a scale with data centers that may have different data retention policies or different storage laws all that stuff it can get really complicated really quickly one thing that people will do is often just default to the most restrictive law across all regions across all data centers that way you don't have to worry about okay well i did the wrong thing in this region but i was trying to be more lenient in the other region and something got crisscrossed in the wires so one thing people will do yeah is unfortunately do the most restrictive thing but this is a consideration this is just something you got to kind of keep in the back your mind when you're starting to scale at a global level another method we can talk about here is having separate layers for reading and writing so this database here may be for writing this one here for reading and these two right here would have to be eventually consistent among themselves again their own database scaling systems they would have to be eventually consistent among themselves but what this would give you is the ability to choose different database systems like say over here on the read side you wanted elasticsearch and over here you wanted cassandra and that way you could take advantage of the individual tech of those database systems that maybe have better writing over here and better reading over here that way you're scaling them separately with different read write layers with different read write databases and finally the last one i want to talk about is sas you could just give somebody money to do this for you essentially i know amazon web services has their own reliable scaling database level and again this all really comes down to the dollar dollar signs i'll tell you what you could scale something really really well on aws but you're gonna pay an arm and a leg for the public cloud just a consideration and that really is database scaling in a nutshell i didn't cover every algorithm i didn't cover every strategy or concept but i hope it gives you a good idea of how you may go from just one database and scale it all the way up to a worldwide millions of user scale so this has been an overview of system design interviews for database scaling thank you all to my current subscribers like comment subscribe all that stuff you know how to do it and again thank you for watching be well everybody peace you

Info

Channel: John Codes

Views: 650

Rating: undefined out of 5

Keywords: system design, database, software interview, scalability, computer science, databases, sharding, interview preparation, database partition, horizontal scaling, vertical scaling, database scaling, scaling databases, scaling dtaabase, web development database scaling, backend software engineer, backend software engineer scaling interview questions, system design interview database scaling, system design interview, software engineer scaling, software engineer interview, john codes

Id: qjsuOUXxD64

Channel Id: undefined

Length: 22min 13sec (1333 seconds)

Published: Sat Jul 10 2021