How indexes work in Distributed Databases, their trade-offs, and challenges

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so indexes make your database look up faster and we typically create index on secondary attributes and Things become really really interesting when your database is sharded and partitioned right for example when you have a large volume of data and one node is not able to handle the load what you do you Shard the database you partition the data and you place it across multiple data nodes right now let's take a practical example to understand how indexing work in case of a distributor data store now say you have a blogs database in which you are holding a large number of blocks let's say You're Building A medium like application in which there are tons and tons of blogs that are published now what would happen given the large volume of data one node will not be able to handle the load which means we would have to create multiple partitions of data and place them across multiple charts and for us to do that we would need to pick a partition key on basis of which we would be splitting the data so let's say we pick a partition key as author ID right so given a Blog object and an author ID we would be determining which of the three nodes is most capable of handling it losing let's say a hash function so we take the author ID pass it through the hash function we know which database would that key reside in and we go to the database and place the data right this is a classic way to handle a hash Bas partitioning you can go for range based consistent hashing pick your favorite implementation there but things become really interesting when we are looking for something specific let's say I want to get that given a user ID get all the blogs from it if I want to fire this query given a user ID get me all the blogs of a particular user your flow would be really easy that given a user ID given a user ID I would be figuring out which data node would the or which data Shard would the data would the data be present for that because my partitioning key is author ID given a user ID that is the author ID I would pass it through the hash function and whatever spits out I would go to that node fire query select start from this table where author ID is equal to this I would get all the blocks listed for that user and send it back and it's a pretty straightforward query that would work like a charm this worked really well because we were querying on the partitioning key itself but now let's take another example let's say what we are looking for is we are not looking for to get uh the blogs for a particular user but let's say we are looking for something much more let's say every blog has a category let's say category could be a topic that to which the block belongs to let's say my SQL engine X go python whatnot now what we want to query is get all the blogs that are that belong to a particular category let's say my SQL category so let's say I take concrete example and say I have two data charts in which I have four blog items distributed 2 cross two so let's say I have two blogs listed over here so user ID one wrote A Blog with ID 1 belonging to category my SQL with some title and somebody so because U1 passed through the hash function spits out one I would put it to one similarly U1 wrote another blog with id9 on go topic with some title and body it also recites to Shard one why because user ID because U1 U now let's say u3 wrote two blocks one on engine x one on my SQL goes to Shard 2 because u3 pass through the hash function spits out two I would put it over here right now given this the ninth solution for us to get all the blogs tagged for a particular category let's say my SQL now here we can clearly see that the blog tagged with my SQL is not present on one node it's present on both the nodes Shard one and shart two now what would happen when the request hits the database proxy it would need to Fan out the request to all the nodes on each of the node fire that query get the response merge the response and send it back and send it back to the user right now here for every query like this get all the blogs tacked for a particular category I would have to Fan out my request across all the database shards combine the results and send it back to the user now this obviously even while explaining it felt really slow it is really slow and there are bunch of risks involed risk number one what if one of The Shard is overburden so you made the request the request went to both the shards in parallel but one of The Shard is slow which means although one shart responded quickly you have to wait for the second chart to respond before you can emit the response to the user so if one Shard is slow it affects the user experience worst what if one Shard is dead either you wait until the timeout happens or you send incomplete result that is another risk third is when you and given that you might be paginating on this you are firing query on this both data nodes or both the shards getting the results there's a huge amount of data transferred over here it is eventually getting either filtered out or paginated and what not before it is sent to the user so this is also expensive so given this there has to be a better way there has to be a better way to index the data which is where we get introduced to the concept of global secondary indexes so what do we do given that our query is for a particular category give me all the blogs that belong to it what do we need to do is we need to maintain a separate index a global index for the secondary attribute category somewhere now this is what most database abstract it for you like for example dynamodb calls it Global secondary index and you and it is like a secondary table that you have but that's the whole idea so what do you do you create a global secondary index which holds your index but it is partitioned by the secondary attribute that you want to query on for example categories that attribute in this case so what do we do is we create an index which is which can be shed internally that's not a problem right but it is partitioned by the category attribute of it so all the post that belongs to let's say MySQL and engine X pass through the hash function splits out the same value so all MySQL post will come to Shard three and all go post will go to Shard 4 for example now here I've draw on this as separate chard but that's not necessarily that these are separate set of machines it could coreside with your existing data not it totally depends on the implementation this for Simplicity I've drawn it at separate cluster for that you don't you might not need to do it because the database is abstracting this things out for you it can decide to co-locate the data on the data nodes let just store it as a separate b+3 on the dis or however it wants to do it but the idea is logical separation is very clear on where your Global secondary index resides and where your data resid it's a logical separation not a physical separation right okay now given that we are having this data stored this way if we are looking for that hey given a category give me all the blogs that belong to it I would directly fire query to This Global secondary index because this data is already partitioned by the category that I'm looking for so if I want to look for all the blocks that belong to category my SQL I can just find a query select star from blog category GSI or Global secondary index where category is equal to my SQL when the request goes I could fire request to this note because I know my SQL would be present over here passing through the hash I would know The Shard ID I would go there fire the query get the blog ID get the data from here and respond back right so this is how simple it becomes so what we did is from The Shard from the data Shard that we had we created a global secondary index on an attribute on which we wanted to query and on this index we are firing the query that select star from block block category GSI where category is equal to my SQL now here again I wrote a SQL query it depends on the database on what it exposes I just made it read right now this query if you look carefully because the global secondary index is actually partitioned by the category the query needs to only go to one instance get the blog IDs then go to data shart read the actual object and send it back to the user so you don't need to go and query multiple charts for this so you literally fire one query get the ID go to another data or place where you get the blog details and all combine the result and send it back it just makes your life really easy and your query really efficient right now this is where you have multiple implementations first is either your global secondary index can just store the reference of the primary key that you have so for example if I'm creating a global secondary index on category I can choose to store the category and the row ID or the blog ID that you have right or I can choose to store the entire blog object being stored over there so again when you have multiple choices you evaluate both of them and database can choose to implement it either one like either one of these ways so let's say if we just store primary key we just store primary key in the index when the request comes to DB proxy DB proxy will go to First the index shards get the block IDs then go to corresponding data sharts for the objects that you want get the block details and send it back to the user so no unnecessarily fetching of data from the data sharts you are only fetching the data that you require from the data sharts and that's really nice right that's really efficient second is if we want if we choose to store all the attributes in global secondary index for a particular Row for example which means the entire document is reindexed it's repartitioned and indexed there so which means here you don't not only have the blog ID but the entire blog object in that case request comes over here you go over here get the data immediately send it back to the user so no need to look up to data shart because your entire document is residing in the global secondary index itself right both of these options are available if you choose Dynamo DB to like Dynamo DB implement this you can tag that as a configuration when you creating a global secondary Index right okay now here we see a classic tradeoff that if we just store primary key reference which means you have to do one look up on index charts and then depending on the documents that you received for those corresponding primary you go to the data shards read read those corresponding documents and send it back to the user right so you're doing multiple lookups there but if you store all the attributes in GSI which means the end document in GSI you are bloating up the index X size but then you're you are reducing a lookup right it's a classic Space versus try tradeoff that you may want to go with over here but another another challenge that comes in is you need to keep the GSI in sync when the main data is manipulated so which means let's say any update that has happened which updates a particular document you have to update the index as well and this needs to be done synchronously because most databases do offer strong consistency with indexes now that becomes another problem that if you have large number of GS nice then your updates and your rights would take a hit it would become really expensive because now you have to update not just in the main data chart but along with your index charts that you have right which is why Global secondary indexes are expensive to manage and maintain which is why a lot of databases actually limit the maximum number of gsis you can create they don't allow you to create any number of gsis you want they basically restrict the number of gsis that you can create on that typically 5 to7 is a sweet spot there but it totally you can create a database tomorrow that allows more gsis than that and you can do it eventually if you want to like make it eventually consistent if you want to right but this is about global secondary indexes right Dynam DB is a very has a very famous implementation you pick any distributed database in the world if it it would have a flavor of gsis somewhere in its internal implementation right because that's what would make like by collocating the data at one place you are making your queries efficient it's a very standard practice out there right now what is opposite of global secondary index it's local secondary index so what if we want to query that give me all the blogs for a particular category from a particular author what if this is the only type of query we would have we would never hypothetically assume that we would never be firing a query that give me all the blogs for a particular category but we would also like sorry we would always be quering that for a particular category for a particular user give me all the blogs so in that case given that our query actually contains the partition key what we can create is we can create a local secondary index rather than a global secondary index so here if your partition key is always going to be part of your query in that case you do not need to create a global secondary index but you can create a local local secondary index and this local secondary index would be localized to a particular Shard so given that in Shard one you had all the documents of user U1 which is all the blogs of user U1 you can create a local index out of it on a local B+ Tre and this index is good enough for you to answer your query that for a particular user for a particular category give me all the blogs right it would be answered from the single node itself no need to Fan out request and get the response back right and this is an advantage that you get so depending on your query pattern depending on the query Clause that you would be firing you need to decide if you need a local secondary index for this or a global secondary index for this by default any and every distributed database in the world would have a flavor of this either explicitly exposed to the to us the consumer of the database or it would be implicitly implemented by the database right so depending on the database you are picking go through the documentation and figure it out but if you look carefully the local secondary index it being local what it does it is easy to ensure strong consistency for that because your rights are going to the same instance where your index is placed so you can have a very strong consistent implementation over here the response will always come from a single not no need to do fan out right but you are limited by the local chart so when you have let's say you are having a local secondary index on category you would never be able to fire an efficient query that given a category give me all the blogs for that you would always be firing given a category and the user give me all the blogs so that is a limitation of it but if your query is always going to be with respect to a partition key local secondary index gives you a really good boost rather than creating a global secondary index for that right so this is what I wanted to cover as part of indexes this is a very fundamental concept of any distribut database in the world either they're explicitly exposing it or implicitly managing it right either way pick up every database go through internal software it's a fascinating domain and try to implement this one it's a really easy piece to implement so if you find time go ahead and prototype this thing it's quite fun to be honest right and if you're interested in going deeper into how you create an index index how index is created using a b+3 I already have a video on it I link it in the iard and in the description down so feel free to check that out and yeah these is all what I wanted to cover in this one I hope you found it interesting hope you found it amazing that's it for this one I'll see you in the next one thanks [Music]
Info
Channel: Arpit Bhayani
Views: 19,002
Rating: undefined out of 5
Keywords: Arpit Bhayani, Computer Science, Software Engineering, System Design, Interview Preparation, Handling Scale, Asli Engineering, Architecture, Real-world System Design, Indexes in Distributed Database, Local Secondary Index, Global Secondary Index, DynamoDB Internals, How indexes work, Database Indexes, Distributed Indexes
Id: eQ3eNd5WbH8
Channel Id: undefined
Length: 16min 20sec (980 seconds)
Published: Fri Mar 01 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.