Rate Limiting system design | TOKEN BUCKET, Leaky Bucket, Sliding Logs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone my name is Marin and in this session let's learn rate limiting algorithms which also works with distributed system scenario 1 consider you have an application which is deployed in an server and then it is working fine some day you started to see sudden increase in traffic say for example 10 times the normal traffic and all of the traffic is from BOTS scenario number 2 consider you have a product which provides API is for machine learning or report generation or weather information you also want to make money all of these API is right so you have a production Enterprise 8 yes but you also want I want to let developers try this API is first and then they will purchase it for that you need to auto developers to access these developer api's say for example 10 api is per minute or 180 ice for are something like that just for them to try and integrate with their application and then they will purchase them how do you do all of these things for both the scenarios the solution is a rate limit rate limiting products your API is from overuse by limiting how often user can access your API once the users quota is finished either the requests are dropped or rejected and it is very important for many important reasons or benefits for example us say for example you have an application in which some users are overusing the api's of the application in that case the other users will get affected so user experience if you want to maintain quality of service or better user experience you need to rate limit from users using these API eyes and it will surely or unknowingly or intentionally the second one is security say for example people might try brute-forcing the login api's or some other API is like promo code or some api is like booking or something like that so you need to make limit to protect yourself from being attacked by hackers or something like that and also operation say for example your application is enabled for auto scaling or Pay As You Go service what if users are bombarding requests to your service just for for the fun are by mistake in that case your server will automatically scale or pay as you go bill will increase so to reduce the operational costs also you need to rate limit so what are the different levels at which we can break them or what are the different kind of rate limiting we can do first one is user based rate limit in which you know that for a particular given user how many requests you are going to allow per minute are per given duration that's called as user based rate limiting and the second one is concurrency or concurrent and rate limit in which do you know exactly for a given user how many parallel sessions or how many parallel connection is allowed in this case the the main advantages it will help you to mitigate DDoS attacks using concurrent rate limiting and the one more of the third one is location our IP based rate limit this say limiting helps you when you are running a campaign dedicated for a location or when you are running some event dedicated to a given location in which you can break limit all the other locations except the location which you're targeting in that way you can provide high quality of service for the location which you want to target and you can know very few requests for the others that way the location at which you are targeting will get a very good quality of service and the termini is several based rate limiting so this is a weird case but it might come handy on certain kind of situation where you have defined a server is dedicated for certain kind of service in which you can rate limit the different kind different services the server provides that way it will help you to enforce some kind of rule on a server rate on a server so let's see what are the different kind of algorithms available to do rate limit so the first algorithm which we are going to take a look at now is token bucket consider we have we are limiting our ApS to file a quest per minute so how does this token make it work for every unique user what we are going to track is we will track the last time at which the request was made and also the available tokens that is in this case since we have about 5 equals per minute so we have Phi tokens left and all this information should be stored against the user ID so in this case I'm going to take you 1 as the user ID okay so every time the request comes in the other way limiting application or software should do two things first one is fetch the token that is it should go to this bucket and then fetch the token for that particular user usually use like this as the token bucket because it is in memory and faster to access and also the second thing it needs to do is update the token so once it access the token once it learns that we have enough token left so it it can make the request and also afterwards it should update the token to the latest token available okay so I'm going to explain how it works say for example request comes in at say 1101 second okay in this case what you portrayed are the rate-limiting algorithm should do is it should go and check so we know that user 1 is making requests for user 1 what is the data is out so that is 11 o'clock 0 1 minute M 0 5 that was last accessed and the available token is 5 okay now this record says for this particular minute was the last may request and they we have five tokens left that means that we can serve the request right so it will serve that's a good question and also they can update the tokens ok since we have made one request let's update the token ok consider the save user you what user one made one more request at 11 o'clock zero one minute say 25 50 second in this case again when the request comes in user 1 and the data available here okay forgot to tell you that when you update this we also need to attach take this time so the time also in this case the request the first request was made it live in our first minute and one second we have a bit out when the second request is made okay that is at 11 o'clock one minute and 25th second by the same user I can try to go and check the data for the user one in this case we have 11 o'clock the last access request let me draw one minute and sent circuit and the token is for let me search still we have broken to serve okay we'll reduce this and so the request back so we can do the same up for the same minute that is learning first minute we can serve another three requests because we have three tokens left so after serving three requests this token will become zero okay that is the idea the situation where we are getting more than fighting quest per minute there could be possibility that we won't get more than that requests so consider we only got two requests that is at 10 second and 25 second we still have three counters three tokens left so so the next stickers which comes in say for example came at 11:03 minute and say tenth a second okay in this case what we need to do we need to access this information over here that means the token information so for the same user a new one if you see that here we have the time 1101 that means that we get to know this the last access time was lemon 0-1 so what we need to do is we need to refill the token at that point of time so we need to update the time from 0 1 0 1 2 0 3 and the token we have to reset it to 5 because we know that this is the first request which we are making at this particular minute and the rate limit number of requests we is allowed is 5 so we have to to fight now we also know that since we have five token we can serve this response so let's so okay and decrease this token to four if the next request comes in at the same minute that is 1103 say some second say 15 second for initiative again check that information 11:03 minute and we have four token that means we can serve the response so decrease this token from four to three we can keep on doing the same thing so we have to do the same for each and every user that way we know that how many tokens are left and how many requests we can serve and when the stokin becomes zero if there are any more requests coming at the same minute okay consider this is also pitch to the last access to time so when the next request comes in say for example at the 1103 and some 22nd of same minute right so now what we need to do we need to get that information and we check that how many tokens are left since it is updated to zero that means that we can't serve the recipe question response and that's why we are going to drop this particular request and we are not going to serve that request and even if more liquors are coming for the same minute we're not going to serve it so this algorithm is memory efficient as we are saving a less amount of data per user in your application but the problem is in distributed environment it could cause race around conditions because for the same user if two requests are coming in different application server both might try to update the same thing or same recall right that could cause problem so the next algorithm is leaky bucket okay how does that work yeah I'm gonna explain consider this as a bucket okay it can hold one two three three requests at any given point of time and this is where that requests are coming in and this is where we are going to process the request so we will have a process request processor over here which runs at constant time that is up to you okay your calculation so consider the stiffest is going to sell make one request or second okay on the other side of the bucket and here is the entry point of the package okay is like fasting and first out like okay that is something like Q okay when the requests are coming in if the queue is empty the way the first way is for you come and sit here and then the second request so me write the request processor will be processing this request if there are more requesters coming in the bucket will fill immediately okay before even we process it and the extra requests like fourth and fifth week where seventh and eighth requests will just overflow from the bucket it can't even be we can't even allocate or accommodate space for those requests over here once on the other end of the bucket if the requested first request is processed so we will have one space empty do will come here three will come here and we have one space empty so maybe this fourth fifth seventh eighth requests are already spilled out of the package so we have an ethicist coming in now and that we get space and then we process it continuously so since on the other end of the bucket we are processing requests Const at a constant rate at any given point of the time we have the resource are we only served three requests at any human part of the time so that way you will know we are getting bursts of requests that is thousands of requests at one given at any given point of time because of this kind of design our servers are not affected so that will smooth them out the burst traffic so the next algorithm is fixed with the counter this is the simplest of algorithm in which we have a counter we keep on incrementing the counter for every request and when the counter exceeds the weight limit we are going to drop all the requests which are coming in say for example we want to have a removed ten requests per not second per minute say something like that so then they post per minute in this case how do we implement say something we are given used radius so we have to have a key something like this user one and end up window so the window is a minute as a window I'd say 11th minute we'll start from zero when the requests comes in the width when the first request comes in you increment the value from zero to one and keep on doing it until it reaches to ten once it reaches to ten you're going to drop all the requests if the request is coming on the same minute that is 11 hour and zero minute okay and if the next request comes in so the key will be 11 : 0 1 and starts from 0 and then for each and every request this number will keep on incrementing until 10 and later on the request will be drop it is memory efficient because we are going to have a key for each and every window and a user and also a number which increments until the rate limit okay and that's how it is memory efficient but there is one more problem say for example you have a window say this is 0 minute this is first minute and this is second minute consider we are getting about more requests in the end of the minute say for example we are getting tons of requests at this point that is when the first or the zeroth minute is about end that is about 55th second or 59 second say for example be got about 20 requests in this case at this particular window or this particular couple of second we get more requests and also if the request continues and in the beginning of the next minute also will get the litmus right so the server will be overloaded so what we try to achieve was 10 requests per minute but actually we got 10 requests so out of friendly request we only access Eliquis right so we got 10 require 50 you on 59th minute for sorry 55th or 59th second that means that the server is overloaded here and also the traffic is not smoothened in case of leaky bucket the traffic is somewhat smooth and because it is kind of buffering using cube or a bucket right but in this case it doesn't happen because we accept written requests in the end of the minute because we had 1100 and we also accepted ten requests in the beginning of the next minute because we had one more window at the next minute okay that is known 0 1 so we accepted in the next minute we accepted 10 in the previous minute end of the previous entry except it'll totally be accepted about 20 requests in about relation of 5 to 10 second and that is bad because we always wanted to maintain about 10 requests per minute so now that we know the problem with window based counter how do we solve this problem so the solution to the problem is instead of calculating the rate in each and every windows let's calculate the rate in real time so the algorithm is sliding locks okay in which we're going to use hash table or radius anything which can store key and value pair say for example we'll have to have a key again based on user ID and yeah just by user already so we don't need to have any time information here so we have user information or user ID over here and we'll take a big list ok consider we have taken a big list in this case for every request we need to append a log into this array for every request which is coming in from this particular user we need to keep on adding the entry to this bigger consider we have added entries so these are the entries which are added to the big array that means that these are entries for all the requests which have received so far each entry should also contain the timestamp at which that particular request came in ok and what we need to do for each and every request for every request which comes in then what we need to do is we need to access this array and then filter out all the entries which are older than one minute say for example are later a human is suppose 10 requests per minute okay in that case what we need to do is we need to sort this particular array and then filter out all the requests are all the entries older than one minute because we have the rate limit of one minute ripe so we have to filter we have to remove all the entries which are older than one so consider these are the older entries okay so let's so out of the remaining entries we need to count them how many of the requests have been served in the last minute so in this case 3 3 and 7 totally we have about seven requests or seven entries in the remaining in this array which belong to the last minute no matter you are accessing at 11 sorry 10 o'clock 30 minutes and 30 second so we are going to look until 10 o'clock 29 minute and 30 seconds so exactly Delta is 1 minute always so we're going to get all the increase in that 1 minute window which is exactly calculated right at the time the request comes in in that case we get a real-time entries for the last minute okay in this case we have 7 that please that we can still allow 3 more entries here or 3 more requests to to be served ok consider we had more increase okay instead of 7 concert' we have 1011 entries in that case ok when we count the remaining last 1 minutes windows entries so we get the complement that we search we have already served about the request in the last minute that means that we can't sell more that's it so so every time if you keep doing this we get to know the real time of how many requests we have served in the last minute based on that we need to check with a later data we have if it is X if these entries are more than the rate limit so we're not gonna sell so everything is good but there's one problem here so look at the data for each and every user for each and every API offer for a particular user so we'll have to store as many entries which is almost equal to the rate limit that is they can have 10 so we'll be obviously saving 10 entries in the list so consider if there are like tons of users or millions of user accessing will be happy millions of users Intervale limit that many numbers of entries in your hash table or radius so that is kind of bad because here we are consuming more memory but how do we solve that let's learn sliding-window counter so this is the optimization algorithm which is almost similar to sliding logs algorithm but in this case we are going to optimize in the memory side so we're going to make this algorithm which consumes less memory and how do we do that in sliding laws we used to have our store each and every entries which is for each and every request in an array right for the last minute but in this case we do the same thing but instead of saving each and every entry for each and every request we are going to save less entries two requests are coming in same second so we are going to increase the entry entries counter ok let's learn how to do that say we're going to again use hash table already's we'll have to have a key that is obviously user ID and then we will again have a array so in this array we'll have to have a entries so these entries are say for example we got a request at 11 okay 30 and first or second year one in that case will have to have a entry which has the same time style okay and then if you get one more request at the same time that is same second instead of just saving the entry in the entry what we are going to say is we will save time you spam let's counter in this case we got two requests for the same time and the second so we're going to keep the counter - - okay and if there is one request which comes in at 30 say 0-3 we'll have one more entry but in this case we'll save the timestamp and there is only one request at 11:03 that is at 0 3 this was at 0 1 okay at 0 3 we have one okay if there is one more request which comes in at the same second that is 0-3 seconds so we increase this counter to 2 re if they're more it was coming or say a final request came at the same time so this counter will be 5 and we are going to keep adding the entries like that and for each and every request also we need to sort and filter out the entries for the last minute just like the way we did it in sliding logs right similarly similarly we have to get all array entries in all the array and then we have to filter for the last one minutes entries and then the rest we need to discuss ok consider we discarded this because it was for the last it was beyond the last minute so we discarded this and we have say we had more increase say this counter to counter at this is a 4 and this is 1 and say something like that so in this case we have 3 entries and then we need to count the counters inside these entries so 5 plus 4 plus 1 that is 10 right that means that we have served 10 requests okay 10 requests in the last minute which is equivalent to the rate limit which we have that means that we can't accept more request further okay if one more request comes in okay what we need to do is we need to fade out again if it if it is a if when we do the same calculation of benefit or if this entry is removed in that case we will get how much for this one that is five that means that we can accept another five request in the in that window so we can insert if you want if the counter if the count exceeds or equivalent to the weight we can discuss so if you see here the number of entries which we have is reduced substantially because we have a counter here instead of having each and every entries now we have few entries unlikely earlier in the sliding law we used to have most as many entries in the list which is almost equivalent to the weight limit right now we only have couple of entries because we are having a counter inside of it so that's how this is memory optimized solution over sliding logs so far we learned a couple of algorithms to solve the weight limit problem now these algorithms works well on a single server setup but when we have a distributists setup with multiple nodes or multiple app servers or multiple laid limit services distributed across different region it won't work the problems are first one is will face inconsistency in the payette limit data and the second thing is race condition say for example in this setup you can see that this particular setup is in a different region called say we call it as r1 and this is in region two so for every region we need different node balancer and also we need a weight limit or couple to the load balancer so I have just written separately just to show you guys while explaining okay similarly we have separate load balancers for region two and there is a rate limiter and this is the app server for this region this is the app so for this region okay and this is either radius R Cassandra which holds the rate limit data which we discussed in the previous algorithms right so that data key value pair data will save in this DB okay because we want to enforce a global weight limit okay if we don't have this system we should be saving the data locally okay if you don't have this data storage we'll be saving the way to be data locally that is like local rate limiting okay but as a whole system we have to control the weight limit we don't know the user ID high level of concurrency the same user request might be forwarded to different regions say in the same second if this disabled request from multiple requests from the same user is going to multiple regions okay what happens both requests will go to different load balancer at the key same given time and it goes to rate limiting and rate limited decides based on the data which is at which it has in these nodes and then it allows request to forward to app server sent and and we get the response back okay so where is the inconsistency say suppose we are only supposed to send allow 500 requests per minute in this case consider for this particular user in our node okay so we have say in this case in both the nodes we have for you one for for you one for so the data is particulate in Pole now for example if two requests are fired from the same user and both records to go to load balancing and don't balancer forwards to weight limiter and now what rate emitter does is eight queries period is our Cassandra and gets the latest rate limit count okay that means that for that user so we have so far several for it was that means that we can serve one more request so what recommittal thinks is so we can serve one more request so what are you committed as is if an auto both the request for from multimeter because both of 30 meter minute quiz denotes they both get four and four both of them think we can still add or one request so the boot-cut followers the requester have several so essentially we are serving six requests but we had a rule like five requests only for a minute so this is inconsistency how we can solve this problem so we can solve this problem by using sticky session load balancing in this load balancer in that case for this particular user we will always no matter what we will always redirect the request to the same application server using the same load balancer okay that way the request from the user one will never go to a different region or different orbiter so there will be no confusion the problem with a sticky session is it is not well balanced design or it is not forward tolerant design because when all the requests from the same user is going to one app server the load of this server will increase substantially and this server is free still will be bombarding requests from user 1 to app server so ideally we will avoid sticky session kind of load balancing so one word one node way to solve this problem is using locks ok so how do we protect the base condition so I explained you the data inconsistency same various condition occurs so how do we solve that by using locks so suppose anytime any number of you know rate limited access the data for a particular user they will lock it and once they access the data and update the data then only they will really stop say for example if this there is a request for a form use of 1/2 the load balancer and the request is going from load balance to rate limiter ok in this case if this rate emitter want to access the data which is of user 1 from this radius or Kazan or not he will put a lock on this particular key so that if any other lately maturer like this one or any other number of freight approaches trying to access the data for you one they can't because the lock is with this guy okay that way we can protect this place condition or inconsistency but the problem with that is it will add extra couple of extra latency to the round-trip time of the request so for example let's go back to the same scenario when user one fires two requests in panel and both reposeful go into different region they both hit more balancer okay and both we form a tool or a limiter so the way first one who reaches who acquires the lock okay gets tracks of the data from Nino okay in that case is they say for example this guy got a lock so he will put a lock on the human data so so I think this guy uses this lock this rate limit or should be waiting to access the space of data in that case there is a couple of the hundred milliseconds millisecond of latency added to this request on response that way this perhaps you will affect the round-trip time of the first response made by this user from this region so that is again a problem so how do we solve this problem so we can actually solve this problem using different strategy but they won't work truly say for example we can we have two possible solution one is relaxing rate limit this is not a real solution but it makes you happy so relaxing rate limit means say out of 100 times if you miss out a couple of times it's okay say for example a rate if it is said 200 requests per minute it's okay if it's ER 203 requests in a minute so that 1% or 2% of requests if it escapes when we have truly concurrent cards happening across the multiple region otherwise if you don't want to do that also there is no real solution because there is a problem because because we have the distributive system and there is always a latency when we sync the data between these two machines so the race condition right say for example one request range here and then now and this guy served the request and then also this guy updates the data from 4 to 5 right ok so before this data is synced to the other node if there is one more request made by this user it's this and then gets the data so actually it gets a app it gets the data as for it won't get it as 5 so this kind of problem are there because there is a latency between syncing the data between nodes so so we don't we can't do much about it so there is a little performance optimization we can still do I'll tell you how say for example so these are our rank limiter in a specific region so instead of talking to these ladies or Cassano knows every time when the requests arise Drake inter what we can do is we can have a local memory ok we can have a local memory and we can have this data placed locally in both of the rev limiter so instead of every time instead of going and fetching the data from the node or Redis node rate limit data from the Redis node what we can do is we can fetch the data which is available locally ok and obviously we can't always rely on the local data or local limit data so what also we need to have is we need to have one more service and what the service does is it frequently syncs this local memory or local data to these nodes a synchronously so they will keep on updating so there will be no caption from regiment instead the update are the data update happens through the service which which is responsible to sync the data at the local memory data into the node and no data back to the local memory so we will miss out some requests we we will not be exactly serving hundred requests per minute we might be serving 100 and 102 but that's okay but we are still optimizing or we are making this performance of really bitter better so if you liked this video please subscribe and tell your friends to subscribe to my channel and also leave comments or if you have any suggestions for improvement in the video and also if you have any requests for me to make the video on a specific topic please comment and like and share thank you

Info

Channel: Tech Dummies Narendra L

Views: 164,716

Rating: undefined out of 5

Keywords: Amazon interview question, interview questions, interview preparations, algo and ds interview question, software interview preparation, developer interview questions, Facebook interview question, google interview question, Technical interview question, Token Bucket, Leaky Bucket, Sliding Logs, Sliding window counters, Scalable Rate Limiting Algorithms, rate limit algorithms, system design rate limit

Id: mhUQe4BKZXs

Channel Id: undefined

Length: 35min 55sec (2155 seconds)

Published: Wed Sep 26 2018