Database Sharding and Partitioning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so sharding and partitioning come in very handy when we want to scale our systems these concepts operate at a database level and it helps us improve the overall throughput and availability of the system in this video we take a detailed look into how a database is scaled and evolved through different stages what sharding and partitioning is understand the difference between them see at which stage should we introduce this complexity and a few advantages and disadvantages of adopting them but before we move forward i want to talk to you about a code based course on system design that i have been running since march 2021 right if you're looking to learn system design from the first principles this course is for you yeah because this is a cohort based course it will not just be me rambling a semi-optimized solution thinking it's the most amazing solution out there instead it will be a collaborative environment where every single person who is part of the cohort will can pitch in his or her ideas and we will evolve our system around that right every single problem statement comes with a brainstorming session where we all together brainstorm and evolve our system that's why everyone understands the kind of trade-offs we made while making that decision instead of just saying hey we'll use a particular queue we'll have the justification why we use only that queue why we use that particular database why sequel why not no sequel right how are we leveraging throughput how are we ensuring that our system scales that's the highlight of this course this course is taken by more than 500 engineers to date spanning nine countries and seven cohorts right people from all top companies have taken this course and the outline is very intriguing it's very exciting so we start with week one around we start with the core foundation of the course where we design online offline indicator then we try to design our own medium then we go into database where we go in depth of database logging and take and see few very amazing examples of data log or database logging in in action and how do we ensure that our system scales through that then the third week is all about going distributed where we design our load balancer i'll have oculus of the actual code of a toilet balancer and understand how pcb connections are managed and how simple it is to build load balancer then week four is about all about social networks week five is all about building your own storage engines like we'll build that intuition on if you were to ever design your storage agent how would you do that right then week six is about building high throughput system seven is about building uh ir system basically information retrieval systems and adopt designs where we design our own message brokers like sqs where we design distributed task scheduler and we conclude the course with week 8 where we talk about the super clever algorithms that has powered or that has made those systems possible right i have also attached a video verbatim as is from my first code where we designed and scaled instagram notifications i will highly encourage you to check this video out right and now back to the video so what is sharding and partitioning a textbook definition of sharding says that it is a method of distributing the data across multiple machines while partitioning is all about splitting the data but before we understand what sharding and partitioning actually is and how we see them in real world let's talk about how a database is actually scaled because sharding and partitioning as concept are very tightly coupled with the database so what exactly is a database you have and you have an ec2 server on that ec2 server you install mysql server and you start the process so that particular process is nothing but your database and that process exposes a port and any user who wants to talk to the database talks via that port for example 3306 the request comes to this port goes then the port is coupled with uh the process the query goes to the process uh your mysql process computes the query and sends back the result right and this as a database we represented it with this very fancy symbol right but in reality this is exactly what your database is this mysql server uses the local disk of this virtual server where it would persist the data durably okay now what happens is when you say you have this database server and you are putting it out into the production to serve the real traffic and let's say what you are getting is you are getting 100 writes per second which is a pretty decent traffic and you are very happy about it so what did you do you started with a very small server small server because you didn't really have money so you started with a small server just to say of course obviously why would you want to have big server to do that and you can do it with small one you have a bunch of api servers who talk to this database to get your things done 100 word 100 writes per second you are extremely happy and done now what happens someone famous tweeted about uh your product and your product saw and growth so from 100 writes per second it went up to 200 writes per second but when the amount of writes increased onto your database you saw a couple of metrics dropping down so let's say cpu utilization started log was was always around 80 90 percent your memory utilization jumped your query execution time increased so then what did you thought hey now that i'm getting more traffic let me scale up my instance this is where vertical scaling comes into the play where you start small but as you see more traffic more continuous traffic or more consistent traffic you would want to scale up your database which means you add you give your database more cpu more ram more disk in actual world when you put a database in production what does this mean this means that your ec2 server on which you have hosted your database you are increasing the cpu and the memory and the disk of this server there is no change into this mysql process mysql process was running before and will continue to run after but what you did is you added more capacity into the server on which a database process is running so now what happened is because you scaled up you are bearing more cost but that's okay because you are gaining business from 100 from 100 per second you are now supporting 200 per second because you need them so but then what you also saw that when the rights were increased like obviously your system is not just doing right but it is also doing reads like some use cases like fetching a profile of a user and putting something into the database and then reading some more information complex joints and whatnot so what did you do you said you found out that earlier one database instance was more than enough to support reads as well as rights but now you have large number of reads although you are supporting rights by scaling up your database to give you a helping hand what you also created is something called as a read replica this read replica has this exact copy of this data over here as well so now your api server can redirect all the read request here to this instance while this instance is free up to do critical reads and all the rights so this is a classical leader follower or basically master follower architecture right where all the rights go to the master reads go to this read replica the data across both of these nodes is exact same and here whenever the write happens on this master it is propagated to this follower right so this is how you handled 200 writes per second by vertically scaling up your machine and added a read replica to support large number of read requests coming in but now let's say someone really famous really really famous tweeted about your product and your product went viral it went viral like anything and now what happens is you were prepared to handle load of 200 writes per second but now you're suddenly getting 1000 writes per second now what would happen when you get this much of right you need to scale up again you said hey like let me add more cpu more ram everything what it requires because now my product is going viral my business will take off so let me scale up my database once again so then you scaled it up now you're supporting 1000 rights per second with reader replicas and whatnot and you are very happy that your system is working fine but then what you find is your bus your this traffic is consistent a lot of people are using your product and now you are seeing a lot of traction onto your product which now requires you to handle 1500 writes per second now you thought as like up until now you scaled vertically you added more cpu more ram more disk whatever it required to handle that load can you not add more to this then you went to the aws console and check hey this is the maximum limit now i can't grow beyond this so vertical scaling has hardware limitations you cannot go beyond that particular configuration because of physical limitation on the hardware so because vertical scaling has limit you are capped at 1000 rights per second but your business is growing now you cannot say no no no my database is not supporting so i won't do that because all right business is business you want to make money and why not so that's why what you do is you then think of growing things horizontally so a truly scalable system is always horizontally scalable right so that no matter how much traffic comes in how many users comes in you always have a plan to scale it out so earlier with vertical scaling we used to scale it up with horizontal scaling we scale out so then what you do you add horizontal scaling now you know that this big of an instance is enough to support 1000 require 1000 writes per second but now what you have to do is you have to support 1500 writes per second so what do you do is you split the data you definitely know that you need one more server to handle that load but now when you add one more server to that but each server has limit of handling 1000 writes per second but you want 1500 so what do you do is you add one more server into this say first server handle you hundred fifty percent of the data and the other fifty percent of data the other server handles right and now assuming that there is a uniform distribution of traffic across the data which means that 50 data lies over here so 50 request could come over here for the data which resides over here and 50 goes over here so now when you are getting 1500 requests per second it is equally splitted across both these servers each handling 750 writes per second your physical limit was 1000 you are at 750 so which means that this one node you will now be able to handle very easily handle 750 writes per second and in totality you handled 1500 writes per second which means that you in as a coherent system or as a unified system you are handling 1500 writes per second where your physical limit of one node was only 1000 uh 1000 writes per second right so you are able to garner higher throughput than what a single machine could offer which is where you would always see at massive scales companies would go for horizontal scaling because it would help them get higher right throughput and higher read and write throughput right so this is a classic advantage of going into a horizontally scaled database and this is sharding and basically this is where the concept of sharding and partitioning comes in handy right so this is why we actually need both of those concepts starting and partitioning here we have actually partitioned the data across two shards so that 50 percent traffic goes over here 50 traffic goes over here and we are able to handle higher throughput so what did we just did what we just did over here is that we added one more data node so this data node that we add is called a shard we added a shard to our database and we partitioned the data so partition is all about splitting the data so database server is sharded while the data is partitioned right so whenever people talk although people talk about those two terms interchangeably but shard is at a database level and partition is at a data level right so overall we always shard a database and partition the data this is a crude over simplification people use term interchangeably and partitioning is all about splitting the data let me give you a simple example let's say i have 100 gb worth of data right and now what i know that my one node due to some reason my onenote cannot handle 100 gb of data so what do you do you partition the data with any of the partitioning strategy we will take a look at it in some future video but let's say you split this data 100 gp data into five smaller chunks of it 30 gb 10 gb 30 gb 20 gb 10 gb right and now all of these partitions has to be mutually exclusive so data present over here should not be present over here data present over here should not be present over here right so these are mutually exclusive partitions or subsets of the data this way what would happen is instead of your one node to handle 100 gb of data or node will only handle 30 gb worth of data so this is where you come you do the partitioning of the data now these are partitions of the data now no one's stopping you from having all of these five partitions into one database server partition is only operating on the data level you can have all of this four or five partitions on one database server or you may choose to have multiple database server that is up to you and that is the number of shards you have so if you have two shards you can split this five partitions across those two shards or you can choose to have all of these five partitions on one chart only right it just gives you nice isolation on traffic so a simple diagrammatic representation of this could be like something like this i have this 100 gb data set split across five partitions and there are two shards over here so what i can have is partition a and c is going on shard one while partition b d and e is on chart two so whenever a request comes in if we need to know where does this data lie is it is it part of partition a power partition b or partition c or partition d or partition e and then depending on where like which partition holds that data you would be forwarding the request to that corresponding shard like let's say the request came in and you know that this data is part of partition b so then the request would come in you know that it is on partition b you know that partition b lies on chart two so you'll forward the request to database server which is shard two and within that it would hit that particular partition to get the data right so partition is all about splitting your gigantic data set into smaller chunks of data so that you have to operate on that small chunk and now if you want to ever move the partition let's say partition a is very hot which means a lot of requests are coming on to this because of its partition c's performance is getting affected so because it is a partition you can actually physically take this and move it to your chart too this is called this is called load balancing load balancing across partitions so partition as a logical entity gives you that superpower of quickly moving your partitions across shards right and that is the advantage of doing partitioning of the data so shards are physical again iterating shards are physical servers partitioning happens on the data part right so how do we partition the data so there is no one global role of partitioning the the data but there are two categories of partitioning first is horizontal partitioning second is vertical partitioning so horizontal partitioning operates at document level or row level vertical partitioning operates at column level or table level we will take a detailed look into partitioning strategies in some future video but how do we partition so like when we talk about just splitting 100 gb into five partitions this logic cannot be random there has to be a deterministic way of doing it and it totally depends on the load you have the use case you have and the access pattern you have you cannot just randomly say hey this partition or this is how i will partition the data like randomly allocating some row to some partition will have a very massive impact on your performance it has to be a very smart way of doing it and it totally depends on the use case you have the access pattern you have and the overall load you have right and we will talk about partitioning in detail in the in some future video but now let's take a holistic view on what partitioning and sharding together looks like this is one diagram that i always use to explain people okay what participating is just sharing with you what happens so let's say when we talk about sharding and partitioning yes and knows so in all four cases we will consider so a database server with no sharding and no partitioning is something that you do locally you on your local machine you spin up or on one server you spin up one mysql you install one mysql database you create a database within that and you start using that so that is your one ec2 server within which one mic one bicycle process is running within which you have your data that you are operating on the day zero architecture where you had one database instance and everything into this so that is not charted not partitioned right but when you say that you have you have not sharded but you have partitioning of your data this might be logical partitioning that you are doing in which you are splitting your data across two partitions but keeping both of them on the same shard so your data is logically partitioned maybe let's say multiple tables or multiple logical partitions of data that you are doing and keeping them in one database instance only a classic example of this is on a mysql server you can create two databases right so by firing command create database let's say you have you want to support two applications one is let's say airline checking system and another is let's say ticket booking system right so you create you type in command create database align create database a basically ticket booking system so this creates two logical databases in one mysql server these are those two databases that is partitions right so when you do not have when you have not sharded your database and but you have partitioned it this is how it looks and then let's say you have sharded but not partitioned what's that use case when you are sharing which means you have two database servers but you are not partitioned which means your data is not split which means both the server has the same copy of data this is classical case of read replica where you have not partitioned the data entire data is replicated there there is no partitioning there so this is a case of read replica right this you would typically use to handle large amount of large number of reads and then the final case when you have sharded the database and partition the database which means that you have two database servers each partition and you have two partitions of the data and each one resides on one server this is to handle large reads and large rights both because one partition was not able to like you have large amount of hits coming in for this partition and for this person to handle large reads or sorry large writes you partition the data and you kept it on two different servers so as to handle higher right throughput right so all four cases of sharding and partitioning this is how you can very easily visualize it right now on to the final part of it what why why are we taking this effort into doing this sharding and partitioning and like what are the advantages of sharding or the advantages and disturbances of sharding so you would typically do sharding when you would want to handle large reads and writes now if you see here with charting you are adding one more database server into your scheme of things which means that you can literally halve the load with by adding one more you can half the load on one database server and handle twice the traffic if you want to right so you are just prepping yourself to to handle large amount of reads and writes coming your way then to increase the overall storage capacity now because you have multiple databases now let's say obviously like everything has a limit the disk of a database also has a limit let's say it is 100 terabyte 100 terabyte is the limit it's very high but assume the 100 terabyte is a limit of one server but you have 200 terabytes of data where would you store it because there is a physical limit of this of this database you would have to have another database to increase the storage capacity so that you can store large amount of data so that is another advantage of doing sharding then higher availability that if one of the database goes down other database can take it like further in case of read replica if one of the database server goes down you can very easily use this other database to continue serving the traffic right what are although sky is not always blue what are the disadvantages of sharding it's operationally extremely complex when you have multiple databases to do like in case of read replica you have to ensure that the replication lag is bare minimum data is continuously replicated and what not when in case of sharding when you have partitioned and charted both cases you need to deterministically know where the partition lies and you would go to that chart and get it and in case of one of the database server getting hot or one partition getting hot you would have to rebalance and move that partition into other data so it's up it becomes operationally very complex but at scale you would have to do that right there is no way out for that and the big biggest disadvantage of sharding is that cross chart queries are very expensive let's say you have partitioned the data put it across two shots and now what you would want to do is you would want to get you'd want to join the two tables one is on this chart one other version this chart how you can do that you cannot do that right so cross chart queries becomes extremely expensive so it's first of all not really efficiently possible it becomes expensive plus it becomes very easy to even communicate what's happening impacts the throughput uh increases the latency and whatnot so typically whenever you are sharding you would always think about a way where you are sharding such that all of your queries are answered only from the only from one chart you would never have to do cross shard queries but in case you have to do brace yourself it is going to be super expensive super expensive with respect to time with respect to compute resource with respect to everything right nice yeah basically that's it that's it for this video i hope you now know the difference between sharding and partitioning where to use how databases evolved and how in real world we approach situations like this so these concepts are not just theoretical given you practical usage like practical use of all of these four use cases right so yeah that's it for this one if you guys like this video give this video a massive thumbs up if you guys like the channel give this channel a sub i post three in-depth engineering videos every week and i'll see you in the next one thanks

Info

Channel: Arpit Bhayani

Views: 69,519

Rating: undefined out of 5

Keywords: Arpit Bhayani, Software Engineering, System Design, Interview Preparation, Handling Scale, Asli Engineering, Architecture, Sharding, Partitioning, Difference between sharding and partitioning, sharding vs partitioning, sharding explained, partitioning explained, real-world example of sharding and partitioning, database sharding, data partitioning, horizontal partitioning, vertical partitioning, how to scale a database, handle high throughput, advantages of sharding

Id: wXvljefXyEo

Channel Id: undefined

Length: 23min 53sec (1433 seconds)

Published: Mon Apr 25 2022