When should you shard your database?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on guys my name is russain and in this video i want to discuss a question that some of you asked me when do i shard my database and short answer for those of you are impatient last thing you want to do is the short your database there are so many other things that you can do before complicating your life with sharding so with that said let's just jump into it guys if you new here welcome my name is Visayan in this channel we discuss all sort of back in engineering so if you're interested to become better software engineer consider subscribing had that be like and so I get notified every time I upload a new video with that said let's just jump into shorting when do you shut your database guys shorting the idea of shorting a database is I know it sounds that the term sounds very sexy arrives of I or my short my database on it makes you sound intelligent but if you think about what you're doing with charting it is a very complex architecture and complex operation to implement so in order to talk about sharda this you have so many other things you can optimize it's like why are you doing it to begin with right so let's start over you have a beautiful database and I'm now talking about mostly relational databases because that's my specialty I very rarely work with no no sequel databases but they work similarly but here's the thing in in relational databases you have a single instance database right and you probably building a web application so you need a web server or an API framework such as nodejs django all that jazz go application your custom application that essentially speaks HTTP protocol that talks to the browser's or some rest clients or any GR job is easy but what is the thing Graff cure thank you yeah so a few anything that's basically she TTP and you make request and that request goes to the web server and the web server makes queries to the database and response gets your back responses mmm kind of change the representation to something that the client understand could be JSON XML anything and then ship it back to the coin that's the rest representation alright change the representation all right so that looks good what's the problem with this you get a lot of requests a lot of people are reading the same that's like I'm getting give me all this employees table or give me I'm reading this profile I'm reading this for I'm reading this tweet I'm reading this thing so you're hitting that database a lot and you're reading a lot is it ok that you're reading from a table right at that table is large it's getting large because you were inserting a lot of stuff in it as 1 million 2 million 3 million 4 million 2040 hundred million rows in it and you might say oh saying doesn't matter because I have an index on it that index is gonna be smaller right and I'm gonna head the index because I'm smart I am going to index the fields and I'm only querying based on those fields that are indexed so my queries are fast but then Dex can give you only so far because the larger the table gets the larger the index gets and almost you doing and full index scan which can also be slow so the solution is not to just split everything in to multiple servers you're doing it you're over-engineering too early and that's what the problem with most people are just over-engineering too early you can still do so much stuff right well you have this this is there a problem now right you have to undo I what the problem first in order to solve it you cannot just solve something that you don't know what the problem is right identify the problem so let's say this is my problem my reads are slow I didn't say anything about rights laws just raw reads because writing is always fast you're in writing to the end of the table and even if you're using a b-tree data structure index yeah if you're using an Asst this D that could be harmful for this is D because you're rebalancing that tree sometimes if you're writing a lot and that could cause some shifts and the desk editors now that is bad as read I'm LS m is optimized for me but not right but that's another story right so I'm reading a lot Miley is getting slower and slower and slower it used to take one millisecond I was like 20 and now it's 100 milliseconds what do you do tables large what do you do with it there's something called most databases of course called partitioning and I'm gonna talk about just the horizontal partitioning so imagine this this is your table and I talk about horizontal and vertical partition I'm gonna reference the video here go check it out but in a nutshell you have a huge table hundreds and millions of rows you just slice it in half horizontally that's that's the horizontal you slice it in the middle and you say your this piece from this ID to this ID go to this stable right and we're gonna call this a partition and this ID to this ID goes to this table I'm gonna call a partition to and this step this ID to this idea got a partition free so now you were sort of having one table you have three different tables that the problem the whole logical table but when you query we can find out based on something called a partition key which usually is this is the primary key of the table doesn't have to be though but based on that you can know which table to hit now you have smaller tables you have ten smaller tables and guess what with a little bit of knowledge from the client we can know exactly which small table or partition to hit and now I'm working with a smaller index and my queries will be fast instead of having this huge index now I have a smaller index and I know exactly what quit so my coils are fast just like that by just enabled partitioning and most databases Postgres for example that's that blindly for you if you enable it and the client doesn't have the even neat the knowledge to know that oh I need to hit this table table partition underscore what no it's just happen for you previously in all the releases you had to have this knowledge oh this is this is stable under stable listing remember guys I'm still in the same server there's one instance one database instance but with a little bit of a tweak of a feature of my database I fixed the problem right you don't read charting we're gonna talk about shorting in a minute what doesn't really mean but you come back you say Husain no that's not good because yes I am my singular reads are fast I'm selecting star from this table and and yeah it's very fast now because I partition and all that stuff but my database server cannot handle because all these TCP connection to a single server just to read stuff it's overwhelming my server and it's the file descriptors are blowing our of proportions and I cannot handle that anymore right so what do you do well I'm not even introducing the idea of caching you can obviously introduce a layer of Redis cash and all that stuff but it comes with its own problems because you will end up with inconsistency and you have to invalidate the cache all that stuff but you can play with that and without introducing that extra layer I have some tricks up my sleeve to fix that problem right and most of the stuff you guys know it probably but replication why is nobody talking about this you have a master backup so and you some some databases have this by default this this features where you have one master and create another server and make this server a backup of that master and what it will do is periodically the master will push the edits write the rights to that to the replicas to the backup so you can create as many backup as you want and multiple servers with partitions and do all that stuff now you segregated multiple servers but you have one right to do all right you go to that dank server you don't want to write to multiple servers that will confuse a lot of things and we're going to talk about it in a minute because reading is easier than writing reading is doesn't really cause conflicts from writing if I'm writing to the same row yeah in the same different servers bad things happen how do i reconcile these changes right it's just you can solve the but it becomes a very challenging problem too soft so you want to avoid complex problem and why would you do it I'm gonna write to myself because my rights are not as bad I don't I barely write anything but I read a lot so you can just multiple servers yeah one one master server and just push all that stuff to that replicas all the backups and now your clients or the web servers but a reverse proxy actually proxy engine x-h anything you want and then start final oh you want to read go to this database oh you know worried you get this the backup go to this back up and now you're balancing the reads but the writes still goes to this database that is the master and some people who come say so know who's and you don't know me because I write a lot and I need to write a lot and my single server that does the right no longer is sufficient it's gonna die because I had write so much to it I really doubt it first because if you're writing a lot you must be writing a lot to saturate a single server I'm assuming you have a big server that does that stuff right well let's say you do you're writing a lot you still have some tricks before the final thing which is charting you can do multiple replication a two-way replication but have two masters on different regions in in the US this is very common this is a u.s. East and this is a u.s. West people from Florida writes to this server anything that has to do customers with was the East just deal with this anything has to do with customer the West deal with this and most of the time they don't overlap right it's really so if you have this segregation use it to your advantage and then synchronization of this can happen very easily because you're barely gonna have any conflict and if you do you're gonna resolve it and if this is too complex you finally go and do sharding which is the following sharding is the idea of having a single key see after all of this all exhausted right come down and do charting right if if all of the stuff are exhausted then go do charting if none of them then probably you don't have to do Shawn because yawning is the most complicating thing to accomplish and let's let's talk about what it's become easier these days but let's talk about charting so what is charting charting is the idea of splitting like partition like horizontal partitioning but pull the cables that are partitioned not in a single in a different table in the same instance same server know both those partitions or shards in a different server all together okay yeah with partitioning you can still do transaction by the way guys the database supports transactions with shorting forwards charting with sharding my my tongue has been swallowed well with sharding you cannot do transactions forget about transactions bye-bye roll back all this beautiful commit stuff acid atomicity bye bye no shouting for you you cannot do all that because what do you do if you want to write to multiple shots this is different servers right and you cannot maintain the asset transactions I know some people don't like as a transaction with favor the others do with the eventual consistency but the old-school kind of a guy all right so so what do you do is I am about to update customer number hundred and that's the real number hundred you say ok the client you're the coin which is your episode whoop server in this case it's ok customer number 100 is between hundred between 1 and 10,000 so yeah as server number one let me establish a TCP connection with several number one and execute the right or the rate right and and you received another request but this request has to do with customer number ten thousand and seventeen or 2017 is between 10,000 and 20,000 and based on my char Dhingra arranged this is server to subject TCP connection to the server to and executed and come back a result and yep you might say yeah I'm gonna use pooling because I don't want to establish TCP connection Anderson and I absolutely agree with you you have to use pooling you don't have to do this cold start of TCPS hopefully one database platform we're gonna use quick and all this problems will go away and they call start and all that stuff but yeah charting is very complex dad logic that we just talked about is in your client application that's hard right it's very hard by the way I made a video about shorting guys and actually show you how it's done check it out here but what do you do guys right this is do you do you really want to do this with your life and sacrifice roll backs and comments and transactions and all this stuff just to do a stinking sharding just to do what scaling you just want to scale the rights I agree it's beneficial but sometimes it's an overkill you get a really neat thing do you really need to shard right all right that being said YouTube's run into this problem as you might as you might think because YouTube is millions millions of videos which with what billions of for hours and hours uploaded right so thousands and thousands of hours are uploaded to this platform and comments and and I watched listened to a podcast with the lead of YouTube talking about how they softly they have a mice my sequel database on the back and by the way they manage all that stuff and the videos are I believe in I forgot where it is maybe it's three I might be wrong there before that was before Google acquired it and all that stuff right so they moved everything to YouTube to Google infrastructure now that's a but they for a long time they had a single my sequel server and they managed to do fine until that rights became absolutely disastrous so they started doing what we talked about and until they couldn't handle so they have to do sharding which is called where we talk about application-level showing the application is aware of the shard which is this piece of the partition there and when you know when you are aware that's problem because you create coupling and coupling is the worst thing you want in software because you want software to be as isolated as possible you want software to have a social anxiety social distancing and these days through social distancing for software keep yourself apart they shouldn't know stuff about each other that's just a general idea about suffering coupling is bad about software come back now what they did is like they had this application level short for long term it's it's ok this is a ridiculous we cannot maintain this because if they want to change some logic and everything breaks right they want it forget about restarting the idea of free shorting is like oh ok I want to change the range from ten thousand because ten thousand is really low let's make it a hundred thousand right each short has a hundred thousand rows nope forget about that nobody can do that because that will break every single client you got to go to that client and yeah so it's complicated so they created this technology called the test and I'm going to reference it here and the test is essentially a middleware that you send your query on top of my sequel in its I believe it's open source and it's kubernetes reddy's and all that jazz and you put it on your my sequel architecture and it's gonna do the magic shorting for you and that is awesome because you send a beautiful sequel query and they parse that query and they understand what shard to hit and you don't need anything better than that if you absolutely because nothing's for free if you want if you have the maintenance cost of implementing the test go ahead and do it any technology you implement there's a cost for it if you're ready go ahead but ask yourself do you really need it sometimes you do not need it that was an awful British accent impression alright guys that's it for me today that was an answer 20 minutes video answering when to shard what do you guys think am i full of BS or do you agree I am welcoming all the comments guys I really learn a lot from from you guys I learned a lot from you because we each develop and and advance when we listen to each other I'm gonna learn a few I might have said something that's absolutely wrong here and someone and and it happened countless of times like hey Husain that's BS with the moment you said a cat minute 7 that's incorrect and there here's why they tell me here's why and here's that proof for it and I absolutely credited them I heart that comment and I sometimes pin it because I admit not perfect we're all learning we're gonna learn a lot and keep learning guys all right guys stay safe out there and see you in the next one good bye
Info
Channel: Hussein Nasser
Views: 31,466
Rating: 4.9021072 out of 5
Keywords: database engineering, database sharding, when to shard the databaes, when to use sharding
Id: iHNovZUZM3A
Channel Id: undefined
Length: 21min 20sec (1280 seconds)
Published: Tue Apr 28 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.