S3 system design | cloud storage system design | Distributed cloud storage system design

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone my name is Mary and in this session let's understand systems I mean for dissipated storage for example AWS s3 or Azure blob stop distributed storage is a place in the cloud where you usually create an account and add credit card information and then start uploading your files the file size could be smaller or it could be huge it doesn't really matter where you keep on uploading these files into the system the system scales automatically and these systems are called as scalable system or elastic systems you are going to pay only for the amount of storage you have consumed or the number of times you have access that specific object or the file or the blob we share uploaded into that system and if you want to design a similar system and how do we do that in the system let's understand exactly that first we have to define the goals of the system the goals are listed over here the first goal is durability for example a double is at least a is that they give the durability of 99.9999 total 11 lines in the end the percentage of durability what does that means is that most likely it's highly durable that when you write a file in today's system most likely like you will never ever lose that file it was always present in that system so you can always access that back but have you ever wondered how they actually tell you that we give you 99.9999 and 11 lines of durability the way they calculate is using Markov chain algorithm it's also called as reliability analysis using Markov chains I'm not going to talk about that but you can read about it it's interesting the next thing is availability so we want our system to be always available so usually s3 se is 99.99% of availability so we want our system to be always high available the third point is multi-tenancy that means that you can create multiple accounts or many people can create multiple accounts and start uploading all of these files into the system and we don't want our system to be separately deployed for each and every person or account the same system will be available for every district different customer and they should be able to view all fyz in their respective console or something like that and the one system will handle all the customers that's kind of called as multi-tenancy and also we want to support virtual hosting style access of the files or data example you can see on the screen that you can access your pocket on s3 using bucket dot s3 dot a Tobias comm whatever or you can always access the other way is 3.8 abuse comm slashed your bucket name and slash all the path to the fights so we have to provide that feature hazard and the third one is the system should be scalable as I said the customer shouldn't be worried about scaling the system or anything the system should automatically scale so that the customers can keep on occurring how much ever files they want to and the fifth one is region specific bucket these kind of systems usually give the ability to create a bucket in specific region and start uploading your files and access the files back and also give the ability to replicate these files into multiple different regions and start reading from there but couple of systems also give you the ability of writing or updating the files from any region but most likely these kind of systems will not guarantee consistency of price and six one is we have to give the secure layer of access that means that all of our API should be working on hash DES and the content should be securely stored in our storage system now let's understand how to build this as a distributed storage system instead of jumping straight to the complex system design let's start with bottom approach where we start from simple design and then keep scaling up to the complex design where it's the force all of the system design goals which we just discussed the very simple design is having only one server having a couple of API expose and we have a couple of TV's of storage in here say for example if we can have about you know 20 TB of storage in the server consider we have 20 GB of hard disk and we have exposed a couple of ApS in here one fight to upload real list the files in the directory or whatever if a person wants to create his pocket all yesterday's caught the okay to create a pocket it basically creates a folder inside your hardest and then he can start uploading the files into that hardest so this works smoothly the problem with this approach is these API can handle only so much of the traffic because there is always is iwell image and also the number of eks which any solid single server can handle is also limited and also you can't keep writing a lot of files family as well because the hard disk will choke so this system is not really scalable so the next question we can do is simply is why don't we just add so in this next version of the system we are scaling horizontally where we have two servers both have 20 TPO of capacity and now we have a case here as well and now we have double the capacity of traffic handling and also the storage now in total we have 40 TB of storage but the problem with this is so user has to decide which server he is writing into because obviously we will be using some of the load balancing technique when the request when the user makes the request to the load balancer it is most likely that the request sometimes the request might end up here in this sometimes the request might end up here so in this case if you use sticky session maybe things might work but it is not always a level what if your code is making a call to access this file it is not always reliable because if you upload a file the file upload might the file might be created in the server but when you are reading the request is going here and when you read from here the file is missing so this is not totally reliable approach at all so what can we do to solve this problem is we have to separate the API part and the data storage part so how do we do that in this system what I have done is I have separated out in the API server and the storage system totally now these aliases are commodity machine we don't have hard disk over here we might have artist but which is very minimal it is most of a CPU and RAM pensive machine so these are dedicated to accept the input incoming requests of read and write and list the pockets or something like that all the write operation will be handled by these servers they are similar to the ones which you using but they are separated out and we have one more layer called as metadata layer or metadata server this could be one more service or a server where it also has a database because we need to track what file is which server or something like that so what happens is when the file is uploaded the upload AKA might go over here and here it doesn't matter first this API will actually talk to the metadata service and request for what is the server to which I should be right in the file so this service will create entry for that specific file which we are trying to upload and mark it as this file is in server one or server 2 so it tells this API server that ok you can go ahead and write it server may be several one so what it does is it goes to the API so it goes to this data storage server one and then writes this file over here and then gives back 200 or whatever respective status code and the next time when the same file file is read and the next time in the same file you said when the request comes in maybe this request will not go to this server maybe this goes here and this API server before it treats what it does is it goes to this database or metadata service and ask for the location of that file and this metadata service will give you the location as one because we had a entry here that this file was written into data store 1 so we get to know where exactly that 5 present and so we make a call to that respective server that is 1 and then we get back that fight so all working well the problems with these system is now what happens if this system goes down right so we will have to scale this as well and what happens if this is this data storage system goes down as well so we will have to keep on replicating that to a couple of servers now how do we replicate it and who is going to replicate it so there's are so many questions over here one way to do is maybe let's hand and hand it over the replication responsibility tools to this data storage server itself so where this storage is going to replicate and who is going to take care when this application goes down and all of these questions again the one simple way to do is let this service itself manage that also so this server is going to talk to these two storage servers always and keep checking the health of these storage and also it keeps checking what is the storage availability over here and then based on that idea it is going to tell these API servers where you should write and what is the situation of these servers maybe if this server is down maybe then it will tell you that instead of going to one go to once a replica or instead of going to two one two is replica and also it takes care of the responsibility of replicating this copy over here as well or maybe it instructs this latest storage server itself to call you know keep a copy of one more copy of the file which is recently written into this file as this storage server as well so that way the system is kind of working if you see on a high level what are the systems we have so we have in separate API server okay over here and we have a separate server which basically takes care of the replication the health of the storage servers and also it keeps track of where these files are and something like that so this is the very high level system design for the destroyed storage now how can we scale we can easily scale by keep on adding these servers so as and when you add these servers and we we have to update this metadata service that now you have these many servers and maybe you have more servers as a replication okay and also you need to make sure that these servers are not in the same data center or region so this should be placed in a different region maybe it's in the you know Asia and these servers are maybe Europe or us something like that so this bucket was crate was meant to be crated in Europe or u.s. region or just pick one say for example if you wanted to just create in Europe region that means that these are the primary servers which are in the Europe region and these are the secondary servers which are in Asia and delicate because why do we have to replicate in the different region is because if something happens to the data center which is in the Europe then we will lose everything so we have to keep replicating the data into some other region where it is safe so that's the way we should be just keeping two copies maybe we have to keep one or copy in that case let's keep one more replication server for every server and then we keep copying all of this data as synchronously to all of these servers and maybe this one is in Australia or somewhere so we know that anytime we write a file we are replicating all of these files synchronously or asynchronously if he chose synchronously that means that 100% we want to lose the fight if it shows a synchronously there are situations that we might lose fine because as soon as we write the file onto this file even before it is replicated okay maybe let's go slowly so our right operation came in and the file is written over here okay and we responded back with 200 that return is successful even before we a synchronously replicate to this one and this one what happens if this server goes down and the all of all of the data is vanish in that case we will lose the five completely so we can't achieve the 99.999 durability so that's why in the case of writes it's good to replicate synchronously and we won't tell the client that we have created the file until unless we have successfully replicated to all of these regions so these are the strategy and now that you have the basic idea of how the system design came into existence let's understand in depth so here is the system design for distributed data storage so before understanding this system design diagram it's better to understand some of the terminologies about data centers so the first one is what is how does the cluster look like so the cluster looks like this diagram in which you can see on the left side and the right side these are their graphs and in each rack there are a lot of servers a rack it looks something like this and each of the rows in which you can see are different kind of servers and you can put in the servers and pull pull it put it back whenever you want and the cables will connect into the backside of these servers and each server in this rack looks something like this data doesn't look like the one which we use usually the desktops and also one thing you need to understand is in each of the servers so you can actually have redundant power supply and network for high availability what does that mean is if power line one goes down you still have always connected power line two and this service uses the redundant the other one or the secondary power line to keep these servers up and running and also the network and also the network lines as well if the first network line goes down and it uses the second Network line to keep the connectivity between these servers with any of the other servers in the data centers or outside the data centers also you need to understand some terminologies about regions and availability zone I'm sure you guys would have known from AWS that for for our easy understanding think regions as different continents and availability zones as different data centers in the same continent where all of these availability zones data centers will not be put together in the same place instead they will be put together in different places so even if one data center goes down you have another two data centers in the same continent so that's the basics the let's understand the system design diagur so except DNS and cluster manager whatever you see below is the components in the clusters and the same cluster will be there in other two regions suppose we want to have a multiple replication say three copies or three replication patter in this case we need to be running the same copy of clusters in other two different regions so this representation is only for one region over here so the first thing we need to understand is the cluster manager this is the upper in the hierarchy and this is the one which manages all the clusters so here are the important functionalities which it handles the question is other case account when the user creates the new account or a bucket in the cloud service provider what happens is that create account basically calls the cluster manager first a manager creates a pocket and then we have to give a virtual hosting style hostname or uh you know subdomain dot s3 dot location dot AWS comm something like that so what it does is it creates that name the bucket name dot whatever the actual domain name and it updates the entry in the DMS with the IP of this load balancer over here so what does load balancer ease is the entry point for all the requests to this distributed storage whatever API it is it is you are uploading the file or reading the file or listing all the files in the bucket whatever it is so you will basically hit this load balancer to do that perform the operation so that subdomains should be mapped to entry point that's the load balancer so the DNS as well as the DNS is updated whenever we visit our bucket name dot s3 dot whatever a host name aid of this console or whatever it actually relates to this question so in here the design is one specific region will be accepting their rights and that also has the capability to serve the waves and the other two regions will be having the data which is replicated from this cluster which is right enabled clusters and the other two clusters can also be used for only reading purpose so now how this works is okay how the cluster manager works is in actually other case as I mentioned and then it actually does the recovery last or recovery as well what it does is if for some reason if this system is down it knows that we are already replicating two more copies in the other two clusters of similar setup so it immediately changes the DNS name to instead of pointing to this load balancer to point you to the same cluster which is there in the other region so all the requests will be keep on going to the other cluster and like this system recovers or fix whatever if it has happened hypothetically it shouldn't be the case like we should be fixing or if it on worst case if some power outage happened or storm or flood happened so this whole data center is flooded so we'll have to physically go and fix that until then you your system shouldn't be down or if they're your your service shouldn't be down it should be higher level so the cluster manager itself is replicated also and it is distributed so if you see anywhere I have mentioned as an asterisk star and mark so what it means is it itself is replicated I don't want to complicate writing multiple copies of itself so this itself is replicated in the data to the database is also replicated so it never goes down so it manages all of these clusters so it redirects all the requests in sort of this cluster to the other and that will be considered as the right accepting cluster so that's the overall again so the second one the third one is the resource tracking this cluster manager will also keep on looking at what are the resources which is there in this cluster and how much is it is consumed and also keeps track of how much data is utilized in the in the existing complete storage available and how much space is left and do we need to add more service to expand the data so it has to behave like elastic storage so it keeps computing or keeps showing the admins that we need to add more servers or whatever it is so that so it's the responsibility and the fourth one is it holds the policies authentication rules authorization search all of that because one of the important criteria in the disability strategies we should we should have a lot of authorizations policies right like admin should be able to do everything some of the users should only have access to specific bucket or the folder such kind of policies and all will be handled over here and say with the databases okay and the fifth one is it also manages the cluster as I mentioned when the disaster recovery supposed to be done or anything goes down or anything this is the guy who takes the action so before understanding the each and every individual components it is better to understand the very high-level working of the system so how does this work suppose if a user has this file to be uploaded to this disordered storage and how does that work so user has this file and he makes a call to the system based on the subdomain dot it appears console comma whatever it is so the DNS will resolve it to this load balancer the request will line into this load balancer and this load balancer will has to pick one of the AKS ever based on a different strategy it could be based on the load or based on from dropping or random whatever it is so the request will be landed into the API server the file is here and API server should has couple more responsibilities so it is going to look for all the authorization policies and all the a que ID API key which this user is using from this database and validates whether this operation is permitted for this key or for this user and based on the policies and authorizations whatever it is so it is going to all mana gate that if it is allowed to do that particular file upload then it has to find the server which will process this request so this design is based on the layering okay so this particular layer all written in green is called as partition layer and this particular layer here which is written in purple color is all streaming layer so if you this design this is how I sure is building I have to read a lot of our related documents and a lot of database data document to understand this one so so this is how it is actually designed so this is partition layer this is trimming there what happens is so this guy the API server we look into the partition map table what this table contains is for any given file which is the server or which is the partition server which should handle this particular request and how does that finds out so this partition how this partition works when the fight comes into the server it will obviously has to assign a unique ID for that file so it assigns the API server resides and UUID which is always unique so once you do hash of this unit UID you will get some number okay so you will have to find the server which is which can handle based on the arrange partition suppose if the range for this partition server is assigned from 0 to 100 say hundred to say 200 200 to 300 so there will be thousands of partition server in any given question so this partition map table actually contains the range and the but does the partition server's IP address or whatever it is okay so now once I have the UID I get I do hash out and then I get the number and by the number I look into the partition map table and I find out what is the package server which will be handling this particular file upload so I found out one among these so many partitions servers and say for example I found rogue this is the partition server which actually uploads or take care of this file operation so this a care VPS server will hand over that file into this partition server and once this partition server receives this file what it does is it talks to the straining layer the streaming here is like one big file which is actually very using a lot of small servers with disk heavy means one one five server in here would have actually hired about 20 to 30 terabytes of storage in it so this whole layer looks like a distributed file system okay for this partition layer they don't know where this file Salazar and it doesn't care anything about it so it just talks to the stream so basically every partition server will have one stream as I stream is basically comprised of many file servers in which it is is is like you can think of linked lists off by server okay and say there is a PI server and this is also quite server this is also PI server every fine server has about 30 TB of space just let's consider that 30 TB of space 30 TB of space so they will be keep on adding from the head okay as Anand these are filled so this would have already filled this would have already filled and this one may be out of that maybe it is still only 10 DB 10 DB is filled and 20 TB is left that means that this partition server can still use this file server to write another 20 TB of data once that is filled a new empty server or empty file server will be added into the top of this okay so that also that data will also be stored in this partition server right what is the PI server you need to be updating and these on the combination of all of these file servers makes of a stream okay so always think like that so this whole group of five servers will be assigned to this partition servers so always the top of the servers will have empty space and these are completely filled so they are just there okay now what happens is this partition server takes this file and writes in to this pi server and it's job is done once it is written over here and the job of the streaming layer eastern replicate on its own so the partition server won't really care about it okay the partitions are very east of partition the rights into these streaming layer what is the advantage of having should servers easy you can battle eyes it and you can scale it so if you are getting more and more requests obviously you can always write one file here you can't really make it bubble because hidest will always will be like in one place but maybe if we have multiple harness you can make that paddle as well so so the weight of paralyzed is you can have multiple partitions over here so you can I have thousands of partition server and then you you will assign a new range for all of that maybe you can use consistent hashing also or you can use range based partitioning or whatever you want to the idea is when a fight comes in based on some strategy you can so I am using I'm using you do ideas strategy to partition it maybe you can use account name itself so if maybe you can assign something like all the account name starts with this letter will be handled by this partition server something like that so we just need to distribute the incoming request through some of the partition servers based on some strategy I'm using the rain based partition and that mapping all stored in the partition map table right so this is the overall idea of how this works so now let's understand these layers in depth let's understand the streaming it so this whole thing is a streaming a right so what we get to the streaming layer is the file or check of the pipe if the partition server thinks the file is stupid maybe it can change and send the piece of the pipe itself okay so what we get is the file are think it like a blob of data okay so this straining layer responsibilities is to get that file and store it into the hard disk and also do the replication so I have listed out all the responsibilities of this layer okay what what is the first responsibility is append only that means the file which should get it from the partition layer should be written into hard disk in appended only fashion you shouldn't be updating in some place randomly the reason why is the the kind of hardest we use these five servers will be the spinning disk hard disk which is not SST because SSDs are too costly if you want to give the service and cost effective weight you will be using the cheapest hardest available that suspending this hardest the spinning disk card is our high performance only if you keep the data appending to the file or to the hard disk you shouldn't be doing the random writes so always make sure that the data we should get from the partition layer is always appended to the hardest not the random writes so that's the first responsibility so as I already mentioned each partition server will be having the streams think it like a list of servers and always the first server will have some space left the rest are completely locked or sealed because these servers are full means the data is already completely stored in it so we don't have space to write in here okay that's what the concept of seeing is the ceiling that this streaming the streaming managers responsibility is also also used to seal the file servers suppose if this as I mentioned earlier if this file server is has a space total space of total size of 30 TB and it is left with only 10 TV now once the 10tb of the file data is written into this file server now the space left in the spy server II is zero the stream manager's responsibility is to keep on monitoring all of these active writable file servers in this particular layer okay it keeps on monitoring of the health it also keeps on monitoring on the size of the space which is left so that way as soon as it gets to know that this file server is the space is empty so it seals it off basically in this database it just marks this fine server as properly full and then it adds one more server on top of this streaming stream so that way if any other writes come files comes to this partition server it will have a new file server where you can keep on writing with empty space in it so and also that mapping will also be updated into the partition map table so it knows what is the latest five server available in this particular string sorry not here if we are picking the database of the string manager so every time in the partition server wants to write it asks the streaming manager so what is the latest disk where I can write the spy in this particular string so take it like the streams also have the name of the same partition server if the partition server is named as one two three maybe these things are also named as something like that okay one two three so in this dream this is full so the stream manager will add a new file of five server so it has empty space so the partition server can keep writing later and also it keeps on writing in a patented fashion only okay so how does the stream manager gets what is the file server which is empty so that's what the cluster manager comes into the picture so the first our manager will keep an eye on the resources available here so we don't want to you don't have a lot of servers with empty space so we always have to keep some buffer so suppose if I have ten five servers over here so it would add up to 30 GB 30 DB into ten it's almost equal to 300 terabytes we have to keep another 30 percent of storage as buffer maybe we can add 30 TB extra disc to the stream because we can't just keep purchasing a lot of fine service and keepings keep smacking into in our data centers right it is not cost-effective so we will place order only means we keep on burning the display server because no no customers will come and utilize everything so 30 percent buffer is still good you know so those empty servers will be will be available to the stream manager or it knows where are those servers he just spots those IP addresses in their table so the lights will keep on going to those machines so it's just that so the next responsibility of the streaming manager is garbage collection suppose if user wants to relate some file then he makes an API call to delete that file and obviously from the action map table you know which partition server is supposed to handle that request and then it comes to one of the five server and from the stream managers database we know that that specific file is present in which file in this particular win which file server in this particular string so that way we go to that particular file server and we are going to delete that specific file which is there in that particular or finest number so that way we created it so there is empty space available but do we need to tell the partition server later that okay there is only space available in somewhere and then keep writing there so we shouldn't be because the first rule says that is apparently that means that we should be doing the random writes into the hard disk so if the space is empty let it be we're going to recover it later the process is for us garbage collection where the string manager looks for although you know sealed file servers where you're not allowed to write anymore and then see if how much it can free up spaces in everywhere okay and then it tries to shuffle and move all the files into the empty space and then make it free and update all the mappings in the table right so that way you can remove one server out of ten or fifteen servers all empty spaces so you will get one free server by moving all the data to the empty spaces over here so that empty server can be used in the stream later when this the top of the five server is spinning you know running out of the space okay so that's how the garbage collection collection actually works so the fourth thing is the stringing manager's responsibility is to take care of the replication as well so none of these layers responsibility is to replicate its the streaming layers responsibility where it is safe in the data so from the custom manager the streaming manager also knows what is the the other regions PI server which we can use it for the replication purpose so anytime in the partition server writes some file entry 8 the spice server the streaming managers responsibility is to take that file or chunk of the file and replicate into the other file server in the different region totally we have to replicate it to copy so one in one region maybe one more server in the other region so that way anything which is returning to this fight server we always have two copies in other data centers there are other two strategies as well where as I mentioned we can do synchronously or asynchronously so how s3 does is say s they always guarantee is that the first time when you create the file it will be consistently available everywhere and the second time when you update it is available as eventual consistency so that way it means that the first time in your ITT they make it synchronous replication and the later updates will all will be a synchronous way of operation okay so so one more strategy here is to do is you will have to replicate one copy internally in the same cluster in the same region itself and two copies to the outside the reason why we have to do that way is suppose if this guy is writing a file here so the first copy will be written into this file server even before replicating asynchronously to the other two regions what if this server for some reason is dead immediately as soon as we write it write it and then we send the acknowledgement back that okay we have written that file into this file server what if this file server is dead even before we replicate it a synchronously or synchronously to the other two servers if we are replicating synchronously it's good because we will be sending back such a status code until unless we are replicated to the other regions but what if we what if we are doing update okay and then we are doing a synchronous replication even before we start their synchronous replication like this Phi is dead then we basically you know lose that particular update at all totally so what we have to do is we'll have to sync replicate in one of the you know fights ever available over here and then so that is that happens synchronously and there until unless this copy between here and in the one more fight server maybe think like we have one more replicators replications are readily available in the same fights way and that's the responsibility is to only to keep a temporary replication files so what we have what we do is the first time in the fight is written here will have synchronously okay that even if it is update on first time writing whatever it is synchronously uptake into this file as well and then say that okay we have finished the update and then you slowly replicate this file a synchronously in that way the rights are also faster and even if this server goes down we still have one more copy over here so we are safe okay so this is the strategy we can follow so the next responsibility is health check of the file system as I already mentioned these three manager we keep on checking the health status of our of this file server if it finds one of the file server is not really healthy so you might take a decision of decommissioning that and copying the data over somewhere or if it finds that some of the server which is already sealed off because it was full he is having some problem is corrupted or something then you can just remove this file add a new file server here and copy all the replicated data from other region because it knows where I have written replicated the data to we can copy it back in the fill it and then keep it over here so the way it knows these swimming manager will have two different tables over here so the first table is which contains information like the stream ID and the primary server responsible the replication one the copy of first replication copy and the second replication copy sever name so where this data is used is when the partition server wants to write to the string you so it knows what stream I'm breaking into but it doesn't know what is the file several I should be writing because these servers might be keep on adding new so you ask these Finley before you start writing so from the table it knows that okay for the stream ID say for example two so the primary server is this one so if this ID is say eleven lemon is the primary server so the replicated servers are some different name maybe the region 2 so R 2 1 to R 2 1 3 maybe so or maybe this is a picutre to region 3 R 2 1 2 R 3 1 3 so that the copy of this same data is present in r2 1 to server in the region 2 R 3 1 3 server in recently so this is a mapping it will also hold so the partition server whenever it wants to write it basically checks in this table basically asks three management's three manager gets this information from the table and gives it back to partition server partitions over knows what is the file where I should be writing and it goes right it to that file server so here I have written like a link list just for your representation but I really despise server 5 servers will be somewhere this is better right you just need to know what is the IP to which I need to keep on writing the data so that's where you get the information from this table so that the table is they one more concept you need to understand is block group and why do we need block row so if you know from the operating system basics all the data which is written into the disk is written as blocks right so the blocks are the group of sectors so in the in the king the disk so if the spinning disk is like this the sectors are something like this right so these are the six sectors okay so and the blocks are group of these sectors and why do we need a block with multiple sectors in it in it is because we only have a limited set of addresses in any operating system so to accommodate more and more memory we use blocking blocks and a block is group of many sectors and each block gets an individual address so so we always know the fixed the block will be always having a fixed size of so in this case as well we can define our fixed size a plan say for example if you have defined it as say one envy or something okay so our block size is 1 and B and we will also create abstraction of these blocks called as block group and the reason why we need rock group is I'm gonna explain it now suppose think like we don't have a block row so we have so many finds writing written by the user they are of the size may be just 1 and B 1 and B 1 and B ok so now our separate has about 30 TBS is he is writing about you know thousands of 180 pipes now it will be difficult for the stream manager to keep on replicating all of these files so instead of that the very better way to do that is you create an abstraction a layer on top of these blocks okay say suppose a block group size of about say 10 MB okay so it's job the stream manager's job will be to always litigate the whole block itself if we lose that block or if the five servers are crash its responsibility is to bring that block back from wherever it was replicated so it's it's like an abstraction of abstraction of over blocks a group of blocks so at enemy block group can contain multiple files or a chunk of a pile it's up to us think of s3 to States a lot many times when we say you know images these images will be hundred cavies or even lesser or 1 MB or something there so the problem I'm just mentioning that 100 MB it could be even higher so it will be easier for replication and bringing it back and all of the stuff so so that's our your block group the whole blocks itself are replicated to the other servers and when the blocks are missing you will bring back the blocks and the blocks for the streaming layer looks like a big file and the other advantage is since we are appending all the data in append-only fashion so in this table right all the you know offset of the fight in this block say for example the file with one and we may be starting from this block and ending it this so we are going to say the offset of that file with the file name here if I I maybe it's the yo-yo idea whatever it is and the start offset zero and the end offset maybe two and what is the primary server which is holding this information is maybe this one lemon and what is that application one interpretation so all of this information you can store it here so this way we know exactly in in this particular block room where the file is starting and where the files data is ending so we know that okay this part is the file which we are storing it here so for the guy since since we will be getting huge traffic this block will be filling in couple of seconds so instead of just replicating a bits and pieces of one on file we can just replicate the whole block group across different application servers so you can go with without having blocker as well the only complication is in that case you will have you'll be replicating the file itself okay nothing much it's just an abstraction of group of files or it could be if it if the file is bigger it could be broken into chunks so the chunks of the files will be replicated somewhere and then you have to bring back all the chunks for this layer it doesn't really matter it's a fire or anything it just thinks everything as a block okay if you just if you don't want to use block rope as I said you just have to replicate those files and then bring back those files when the data is missing and then extinct we need to understand is about the partition layer specifically partition manager we know that the functionality of the partition servers already so if you look at there is a zookeeper or you can use any lock manager coordination manager the reason for having that is as I already mentioned that there could be that one or more servers of this or service of the string manager could be running we have to make sure that there is only one service manager per cluster is always up and running because we don't want to service manager to be managing all of these streams so you can use zookeeper to decide who is the master and also similarly partition manager as well we always need only one specific partition manager to be up and running for this you know for the reliability purpose we need to have one more so that also can be handled by using zookeeper specifically you can use sequential z node to identify who is the primary partition manager or street manager and the primary responsibility of the partition manager is to assign partition server to a specific range of you know partition IDs so and also using zookeeper partition manager always make sure that there's only one server which is handling a range of you know the partition IDs because it will be problematic if one or more severe partition servers are assigned for the same range it will become difficult and the other important things partition manager will make sure is say for example if more and more files are written for some reason if more traffic is coming to one specific partition in that case this is heavily loaded the AP is will be slowed down or the latency will increase or we can't really make SLS so what partition manager will do is partition managers responsibilities also to keep on checking the Lord or health of these partition server if some party service goes down it immediately assigns one more participant servers and update the partition map table so the APA so when it looks back it knows what is about the current active partition server and also when when the partition manager keeps on tracking the load and health of these partition server when it figures out that one of the partition server is heavy laid over it because more rights are coming - for some reason all of them are falling into this range of 0 to 100 usually we have to choose a partitioner in a way that it should be equally distributed it might happen that there could be sometimes hotspots so partition manager should be splitting this partition into two instead of assigning 0 to 100 what it will do is we create two ranges one for 0 to 50 and maybe 50 200 so instead of having one range from 0 200 now we have two partitions a new partition server will be assigned for 0 to 50 and a new stream will be created so that way the load now is balanced treated those partitions so that way we can handle the the same amount of traffic in very less time so we can still meet the isalus so and also part of that is the partition manager will request string manager to create a new stream so our new file server will be added and the other tables will be updated accordingly ok so now one more interesting thing you need to understand is suppose there are cases where users will be accessing the same file multiple times is it worth it to always go through all of this cycle and then read the data from here so they there is a caching layer as well make a server if there is a request coming for the same file multiple times usually these files are also cashed in the cache layer so when the next sequence request comes in the aka server will actually read that data from the cache itself and serve it back instead of touching any of these things and there is one more concept called as one-hit wonders so it's specifically a one-hit wonders so it means that instead of caching the file access the very first time we should be doing that it's not just in this case anywhere when you're actually caching it's always not a good idea to cache it on the first API call itself because a lot of times these kind of calls could be just only one call in a day if you since this kind situations then it's not to do so instead you can have a counter if the request is more than specific threshold say maybe if the request to that specific finally is more than five times then only you cash it so that way you don't really base the cash storage as well and you definitely make cash you definitely cash the pies one for the ones who has more access rate so that's one more strategy you can consider over here the other thing you need to understand is how the partition manager data will data and also stream manager data will be replicated to the other firster we have refer the to the other cluster the customer manager has referred it as the replicated replication cluster so everything which we save in here will also be replicated back to that at the other cluster so the same information is available in those clusters as well for read operations so if for example if this something happens to this cluster and this is not no more functional then the first row manager will immediately change the DNS name to the other cluster so all the writes will be handled by the other clusters if everything is still fine we can always point to all the reads to the other you know cluster as well if you can configure that way all the writes go into this cluster on if there is a read request you can configure to serve from the other place as well the only thing you need to understand is since we are going through the in you know eventual consistency model even though if the write happens here by the time it replicates if there are please happening on the other cluster it might read the old data so that's actually that's that's how the s3 also works it is still fine if the users are fine we can enable that so otherwise we can always use this cluster yourself as a read and write and we can switch over to the other cluster only when something happens to this cluster I guess I have covered all of the important cases which I should know yeah so if you feel guys liked this video please hit the like share subscribe thanks a lot and if you want to buy me a cup of coffee or a dinner you can always join to this channel it just costs you a dollar a month thank you
Info
Channel: Tech Dummies Narendra L
Views: 46,363
Rating: 4.9416909 out of 5
Keywords: Amazon interview question, interview questions, interview preparations, algo and ds interview question, software interview preparation, developer interview questions, Facebook interview question, google interview question, Technical interview question, software architecture, system design, learn System design, cloud storage system design, distributed storage system design, s3 system design
Id: UmWtcgC96X8
Channel Id: undefined
Length: 52min 47sec (3167 seconds)
Published: Mon Jun 08 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.