Distributed Locks | System design basics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
multiple key for a single lock it's common but can we have multiple locks but one key yeah it's also possible all we need to do is manufacture multiple lock and configure this key to work with those locks now we can distribute this locks everywhere and then we can access the information or objects or things which are locked is in this just one key and intercession let's learn how to design distributed locks before we go into design we had to also learn the basics of locks new tricks and simha for you guys would have learnt all of these things in your academics right let's go back and remember all of this stuff locks and matrix are little bit similar all we need to remember is locks are used to lock critical section from threats of the same process usually but whereas mutase allows us to lock a critical section or an object or a file or anything from multiple processes or threads which are running in multiple process that's the basic difference and coming to semaphore there like to are multiple types of symbol for the basic thing which we which we use sever for ease to are no particular group of accesses or threats to access a particular critical section and also there are something called as counting some of course where these are used to count the number of accesses who are accessing the critical section or a practical particular resource one example is in chrome per horse name we can have maximum of six or seven connections only we can actually use semaphores to count how many connections are open for that particular host name from that given tag whereas in mutex the good example will be say there are multiple processors running in your machine or operating system in which all of these threads or processors are trying to log information on log the log lines in one file that's when we need to lock the file right or read to do anything without further delay let's go toward whiteboard and start designing the descriptive lock back in 1990s every website used to run from a single powerful machine we used to scale vertically that means we used to upgrade the same machine to make it much powerful to serve lot of traffic or to store more data in it and these days because not many people are using Internet services we can't just do the same thing now we have to do a horizontal scaling that means that we have multiple servers working for the same purpose or for the same website say for example I have a website I can't just serve information for the same website from the same a single server if more number of people are coming into my website obviously they are saving more amount of data that means that I'll have to do horizontal scanning request can be served from any of this machine we usually call these kind of systems as distributed systems let's take an example and understand how exactly distributed systems work and why exactly we need distributed locking mechanism in this example let's take there are four different instances which are available in our a distributed system which is which we have for a specific website this machine is in different part of the world and this machine is also in different place say for example this is in America and this is an Asia somewhere and this is an Africa and this is an Australia okay now what these machines are trying to do is we have some data which is stored at one place say for example these are are some long files or something ok and what what is the job of this machines is to rate these log files and figure out some some information or some useful information and write back those information into some other place into some different files say for example each individual files output will be sorry to each individual files here say for example I have three files here one dot data two dot data and three dot data and the output will be also say we were not out to dot out and three dot okay okay now and an ideal situation every instance will pick up one on file and then they processes it and then every instances five saves the output into respecting Vice name dot out say if this guy picks up one dot data and he writes in toward on how and if this type except about data and then he writes to rot Wow now everything works fine but not always now what if this machine picked up one dot data file and this guy this instance which is in Asia also picked up the same file that is now both are working on the same file itself say that is one dot data and these guys are working on wonder data and once they finish it they both write the air put into the same file we sense a large problem here this is unnecessarily two machines are working on the same file it's unnecessary right that's what the first problem is how to make the system efficient we don't want to machine to become the same file and process it and give the output it's fine it's not just fine it says yes one machine should pick up that file and then process it it it's not okay for two machine to work on the same file because it just increases the server cost and it just takes more time to finish our job and the next problem is integrating what as integrity means it's the correctness of the output or correctness of the file which is between our work since these two files these two instances are working on the same input file now the output will also be written into the same file that is wandered out now the problem here is this case also writing the same file and this guy also now there are high likely chances that this file gets corrupted once this guy at this file this out profile is cut it is totally unusable now we have already used to machine to process the same file and end of the day we also got the character output and this is real bad that's what we are trying to solve the integrity okay that means that the only solution to solve this problem is to lock this file as soon as this guy picks up this file he should do not this file that I am using it so this guy he comes in to pick up the same file but does that mean get to know that someone has locked this file so he must be processing what he does this even instead become this file or this file are they near the face which is unlock so everything works fine and also since this file is not processed by multiple machines it there are no chances that the output file will get corrupted so the only solution is lock and now you must be thinking yeah we can just use simple lock and then solve this problem okay I know what you guys are thinking so we have one machine okay where this machine when it picks up a file he also makes an entry in the database or a file or anywhere I'm going to calluses say I won as an entry into the database that he is working on one more data and when this guy comes in he also does the same thing I too is working on to pay down and when this guy comes in he checks is anyone working on three dot data over here and then if it is free he picks up and the works and also adds an entry here I treat read your data and everything works fine again but what if what if this machine goes down I can't we have the same problem the efficiency problem comes into the picture and integrity also comes into the picture that means that we have added single point of failure this machine if it goes down again all of this data is corrupted what if it takes one day to one-five say we are making like some machine learning model seyforth and we're running some adenine or some image classification something like so one day worth of computation is gone we don't want this law to be single point failure and also if you use database it will be you know it will not be that efficient because there are too many writes and reads database is not that efficient instead we can actually use in memory system something like radius or memcache basically we're trying to use cache over here and again when this instance failed we again it's the same problem for that what we need to do is we need to distribute this locks system as well basically we have more machines which are actually working for the same purpose that is for the purpose of blocking so we should have a lock manager where we have one or more machines which are running and it also replicates the same data here I think maybe I see promise ly so every guy comes to this machine and then as initely and i synchronously we are actually sitting the data back to its slaves something like slave like as soon as this is the accidentally he picks up the files only starts to work and also this information is slowly replicated back to the slaves or the other machines say I one dot this information is allocated in this information is replicated and this one information suffocated or maybe we can even think like this machine is placed somewhere near to Africa and this machine is placed somewhere near Asia what problem we we face now is because these machines are also now far away they are not in same data center it it will take lot of time to sink the information between these machine there's there's no single point of truth if you have one machine the problem is single point of failure if you have multiple machine which are a synchronously synchronizing the information which are written to them there is no single source of truth how do we tackle out of this problem say for example if this machine it goes down and if this information is still not replicated the same problem might occur more machines comes in and they they see that but since this machine is down they might connect into this machine and then they see that no injuries are there because this machine went down even before when this information was sent and again we we face the same problem we face the efficiency problem and we also face the integrity problem all of these problems are there so we have to design our system in a way it are it actually answers all of the problems I just mentioned before designing our distributed lock we also need to define the properties our log first one is mutual exclusion or moodle's we don't want two or more threads or processes or instance store quite the same log and the second one is deadlock free we don't want our systems to wait for lock which the opposites are holding we don't want force to happen that means that our system should be fun tolerant as I have already explained to eliminate the single point of failure or to make the system fault tolerant we included more number of result that is nodes right now the problem is these nodes are distributed also the sync is happening we are sick or if this machine goes down the information now it's not available what if we don't have these machines it's a single point of reference they might chances of violating the mutual exclusion and also there might be possibility of tear off we don't want anything to happen this way so what we going to do is we will design our system in such a way it should adapt to all of these properties which we have defined now instead of building the entire system at once let's take it little slowly and then see the different understand and then we evolve the system now for the initial design we have block manager we have cash we can we can have DB also the only reason which we are using cash as I mentioned East I would faster read and write and we have two instances which are trying to acquire box and release the dogs because these guys are working on something which it needs loss say for example now the distance one tries to get a lock he sends a request to the lock manager I need a lock okay and this day returns block but before doing that he has to place lock into the cache what he does this he writes the information into the cache safe lock and then he goes back and then use the lock to this instance one and it's good he is working working working and once the work is finished he would come back and then say stood lock manager my work is finished facing was alone he will remove the lock now ok and now this instance and all the instruction infractions are finished and when this instance comes I'll ask the loft manager and the same thing happens he goes in the in case and he locks it and then he turns back I don't think is working fine we still have the problem of single point of failure will solve that later now let's talk about other scenarios as we consider this guy has acquired the lock and he goes back and then rice out value or a key lock so we know that there is someone hazard work when this guy goes and has to lock manager before even this instance releases the lock he can't give it he says that no you can't get a lot his weight obviously the weights and then he comes back and then still the lock is not terrible and he goes back and then he comes back and then same thing happens what if this guy is taking so much time to finish some job that he's not even releasing block it's not good because not just one instance if you have more instances in our system everyone will be busy we can't just make make one system to acquire that clock so what we do is let's put let's do one thing instead of just mentioning log since it is cached we can always specify the detail let's change our call in the lock manager to also specify the TTL say for example we specified maximum of five second a lock can be hold headed to a specific instance but the same thing works well is the asked lock and then he gets it and he's working now even though if he doesn't release the lock within five second the lock will be automatically released by the system so this lock is vanished of the five second and anyone can later access the lock even if this guy didn't release and everything is fine now but there are other problems ok now that we learned we need TTL to ensure our locks are proper so I'm going to mention all the features here that is we need teach you that we have already told the loss manager to put the details so now let's go for the second scenario so this day acquires the lock request and then I quote is the lock so now we have detail also safe now he's working in between this guy comes in and then ask for the lock now the lock is given to the instance one so he won't get it and after 50 second he comes back and asks give me the block since five seconds has already elapsed and this guy hasn't released the Lord obviously the stock would have vanished and the slaw policies that yeah lock has not given to Haven so what you would obviously give it to lock and fire he gives dough instance to so this guy has got the lock this guy still hasn't removed lock after maybe seventh second this guy finishes his job and then he goes back to the lock manager and he says it is the lock or this guy does is blindly he releases the lock which was acquired by instance - and that is bad now after seven second the lock has released right and anyone can count acquired now because the mark is with this guy and he is still processing now there are high chances that system might perform in efficiently or they might be data corruption now we can't just have lock we need to have a productive mechanism to do that what we can actually do is instead of just mentioning block we can use time or we can use unique ID to mention the lock to create the lock say for example instead of just mentions month we can have lock underscore say 1 & 5 second TTL ok 1 means that knock one is a quite here so whenever he tries to this he will all be see the lock manager should check he is the same Rock which is long and you're feeling is then only I will release say let's say be the same singer and see whether we are safe or not he goes to lock and the lock manager lost and then he not through the name lock underscore one he says that your NOC name is mark underscore one so he goes back with log underscore one obviously is guy is a little slow he will take more than five second so in between this guy comes and asks for lock he says no which is locked I can't give you the lock obviously he goes back five second has elapsed and this guy hasn't come back to unlock it there comes this guy checks is the lock available so obviously this problem vanished after five second so he says that the unlock has been yeah lock is available and you can take it now this guy will give a different Rock than before because he can't give the same block now here we give knock underscore two and he also says long understood - with five second detail now - is available this guy and after seven second I mean since two second has elapsed and this guy after five second means to second he has taken extra so I sense and he comes back and the scissor sisters lock manager please release the lock and he strikes to release the lock long one since that block one is not even there he will not do anything he says that you're late and I'm not raise the lock and then everything is funny so this guy still has to knock - and no one can get the clock locked is still there and after 5 second it will automatically elapsed or if this guy finishes the job even before five-second TTL our time out he will go back and insist that release the lock - and the thing obviously is a lot - and it is working fine and all these problems are solved now more features need to add this add unique number along with the lock that way we have a unique lock created every time but the problem is who will give the unique ID so how do we generate random number say the random number generation will not happen on the instant side it should happen on the lock manager side consider lock manager generated random number what if on worst case we got same random number - time and after time mode he came back and to the left margin says that please visit all unfortunately there are times when that can happen so it's not safe to do that way also what we can actually use is we can use something like coordination services a zookeeper which maintains which can give different numbers random numbers or a range of numbers without conflicting but but the problem is we don't want to run this zookeeper or some other coordination server just for the lock module so you said what we can do is simply instead of taking a land or number we can take time up to say micro second precision so when you take a poke time for example with microseconds precision most likely since we have a single lock manager we won't get the two decades so the unlock is kind of safe or is it safe it's kind of safe let's take if not micro second we can take maybe nano second position so it is still safe for now let's take our time precision you know millisecond okay so we have the time millisecond and when we are knocking instead of having the unique number we just use the time milliseconds a lock have epoch time our time with millisecond recorded in here so we have a unique in a lot created for every time in the instances are asking to lock so that way when the instances comes back with a lock name we can just compare whether the instance came back with same lock and then we can remove or not okay now all these problems are solved but how will the problem of a single point of failure now out of this problems are solved now however the problem of single point of failure or how do we make our system fault tolerant for that as we earlier discussed we need to have multiple you know lock managers and this combination will say something like this is self is one manager instance so we need to have multiple of them so let's do one thing let's say we have few more say 1 2 so we have not manager to long manager 3 long manager for Nigel 5 to make our system work well what we do is let's say a block manager is all in the same data center this machine or any machine who tries to acquire rock instead of acquiring in only one machine instead he goes to multiple machine and acquired lot from all of this machine so he has to give me give the lock from this machine he also has lock manager to instance to give the lock he wants to ask this machine this machine and that machine so basically he is asking all of the five different instance to give the lock only if he gets five locks that means that he has got the lock and he can proceed with the work and when he releases also he has to tell all of these five instances to raise the law and the same mechanism happens setting time of setting the unique number and that we took it has a time of millisecond resolution so lock and release actually happened some more of the five machines but the program again here is we have hard could it be to health silos what if one machine goes down so we only get four logs at most so we can't proceed with the work so to solve that problem instead of taking lock from all of the machine let's configure our instance to acquire lock to n by 2 plus 1 that means that more than half of the machines say we have 5 machine that means that we should acquire at least and by troubies like 2.5 let's take ok at least 3 machines that is like reaching to the quorum or reaching reaching to the maximum modes so anybody who wants to acquire up should get 3 lakhs from any of this fine machine it's fine it's not this machine of this machine in any of this fine machine if this machine won't start quite he needs to get three words say for example if you have six machines then how many words we need to get 6 by 2 is 3 3 plus 1 is 4 so we need to get 4 volts to proceed with our work now how the system is fault tolerant now the good thing over here is any of this any machines out of these displacer of machines can go down we just need to always look for n by 2 plus 1 that's it of the machines which are alive say for example two machines went down anything this terminal extra it's fine we still have 4 machine anytime we'll check how many machines are available so that is n by 2 that is 4 by 2 is 2 2 plus 1 is 3 so we need to at least acquire 3 lakhs to proceed with our work so I just forgot to mention one consideration when you're actually using time millisecond along with to lock 2 things you need to remember since we are using so many machines were sort of machines just for the lock manner itself every instance is trying to acquire n by 2 plus 4 machines that means that every machine should have same date and time ok if one machine has some one-second delay or some millisecond offset that means that not all of the four locks which are required doesn't will not be of the same lock lock underscore the same time right so we we should have all of these machines tuned to to to give the same data time to the position of millisecond but one more and also another thing is when say for example if this instance is trying to acquire say for walks maybe he is making a parallel call I mean concurrent calls to all of these machines out of which he is trying to get four locks not always that the request will go at the same time so not exactly every time time underscore millisecond will be exact time there can be little difference right so we also should consider that all of this for doc have a some sub millisecond tolerance even if the logs are off by some millisecond then it's fine it's it's like we have got the lock I think this is the higher-level idea of how we can actually develop you know distributive laws and there are so many different algorithms there is chubby algorithm by Google and there is there are tons of parallelism so ever this is one way to tackle this problem and I just wanted to give you guys the idea of how we can solve this problem I think yeah this is a short video I think yeah average to end of the video and there are so many libraries from Redis itself it's called as a red block you can you guys can go and try that algorithm if you want and yeah there's the end of the video and as usual if you guys like the video please like subscribe to the channel and share it with your friends and yeah comment and give solutions if you have any thank you
Info
Channel: Tech Dummies Narendra L
Views: 85,940
Rating: undefined out of 5
Keywords: interview questions, interview preparations, software interview preparation, developer interview questions, Facebook interview question, google interview question, Technical interview question, software architecture, system design, learn System design, system design tutorials, distributed locks, design distributed locks, design locks, distributed coordination, redlock, redis lock, scalable lock, software lock
Id: v7x75aN9liM
Channel Id: undefined
Length: 28min 50sec (1730 seconds)
Published: Thu Jan 17 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.