Handling timeouts in a microservice architecture

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so microservices are great but they come with a warning it says handle with care and one such and basically one such thing that you need to definitely handle with care is timeouts right so in this video let's talk about what exactly is the problem with timeout or what exactly is the problem with communication that requires us to have basically timeouts in place and discuss five approaches that would help you handle timeouts better and build a very robust micro services based architecture so microsofts are great because you can you can you have a total separation of concern because your service is focused on only that one thing it gives you flexibility and teams agility because you know you can just hire specific set of people to solve that one problem really very well which means scaling will not be a problem you're as you're picking the best tech stack to solve that problem no hacks involved so everything's great but one concern that comes in when two services are talking to each other when two services talk to each other we need to handle timeouts like let's let's consider a scenario let's say we are building a search service let's say we are search t and what we have we use elasticsearch to find the most relevant search results for a particular query using uh and then the the the entities always you are building search is let's say blogs right and what happens there is user fires a search query for that given search query using elasticsearch we find the most relevant blogs and we want to serve this result to the user so we cannot directly send this response to the user because there is one attribute that needs to be filled and this attribute is the total number of views that the blog got to date right so when the query was fired what is the total number of views that the block got now in order to get this one attribute what search service would have to do is would have to talk to analytics service which has this information right now here this communication becomes a synchronous communication because user is firing a search query request comes to search service search service first computes the most relevant document with elasticsearch then it immediately fires a query a request to analytics service which says that hey these are the blog ids give me basic analytics information of all of these blogs so it uses let's say mongodb as its database it gets that information sends it to the search service search service that injects that information in the response and sends it to the user so here the problem because this is a synchronous dependency the problem comes up when your analytic service is not responding fast enough right or not responding at all right then how long should your search service wait for a response from the analytic service that's the concern like should it wait forever of course not it needs to wait for a sp for a maximum for some time which is the maximum that it can wait right and then if it did not get response in that time what action should search service take like either it can send the partial response that it has to the user or it can choose to not send any response to the user right so there are so this is how you need to handle the timeout so this this is very domain specific decision that you will be taking but you need to be aware hey what if the timeout happens what should be the behavior in that case right so let's understand what actually could go wrong right when there is this inter-service communication right so because once we know what could go wrong we can then decide on the approaches that we were taking so first thing that could go wrong is when there is this my inter service communication is where let's say somehow the request made by the search service never reached the analytics service so it thought that it made a request and it is waiting for the response and all but your but it is but the request didn't even reach there this this flow would not happen when you have a tcp based communication let's say you are making an http call to analytic service to get that information this is not possible in that case because your once if you are unable to create your tcp connection then you would definitely know about it right as part of your exception or something but due to network congestion due to an asynchronous communication this could happen where you think that you fired a request and waiting for response but this request never reached analytic service for some reason second is although your request reached the analytics service analytics service performed a computation but when it send a response the response never reached you due to network congestion broken connection and whatnot right this could be second possible case third possible case would be your search service made a request on analytic service analytics service is computing the response and is taking too long to compute the response may be because it has it is having a database outage maybe because it is doing a lot of heavy processing maybe because it is overload like the api server of it is overload and what not and there could be n number of reasons to do that like there are so many things that can go wrong when you are having a synchronous dependency on a different microservice so which is where we talk about like how how do we solve this problem that is where the first thing that we always need to take although like no matter what we always have to take care of that whenever you are making a network call add a timeout to it always always use timeouts whenever you are making a synchronous call to a service you do not or you cannot and you should not wait forever for the response to for the response to reach you like it should be like hey i'll wait at max for let's say one second two seconds five second ten second depending on your domain and then if you don't get a response then you say hey now it's a timeout now i'll do whatever i want i was waiting for response enough so it's like meeting her it's like two friends decided to like you and your friend decided to meet and your friend didn't show up how long would you wait you can't wait forever you have to you have to wait for a max version listed 10 minutes 15 minutes and then you say hey hello i'm going right so this is the exact same thing with two friends trying to meet one friend didn't show up how long should the first friend wait so this the maximum time that it would wait is called as a timeout right so let's and the the timeout is not something which is very it just it is very prescriptive it depends on the use case that you have that how long should be the timeout let's say in a service like search although i wrote 10 seconds but in a service like search you could you don't want your end user to wait for even more than a second your search result needs to be served very fast so it might be in like 200 milliseconds 300 millisecond 500 milliseconds but there has to be this timeout that other service agrees or hey i'll be no matter what i'll be serving the response in let's say 100 millisecond so you can keep your sla for 200 millisecond or something right but timeout contract should definitely be there now let's see when when search service was when service one was communicating with service to and service student send a response what should be like how should we handle that timeout let's say the timeout happened now how should we handle it so we'll here we'll discuss fire process to do that approach one ignore the best one let's assume that the but obviously not recommended but like like when you don't catch any exception this is what you're typically doing you are ignoring where you made the call where your search service made a call to analytic service and it and it resulted into and it basically waited for something i did get a response it just ignored it and just moved forward so whatever the partial response that it had it just sent it to the user rather than waiting hey what needs to be done so just ignoring the the the no show of that other service that he let me move forward with it so here one thing is that we either assume that the operations like in some cases where we may assume that the operation is succeeded where we did not get a response like not in search and analytics use case but some other use case let's say you want to asynchronously store something in the database and you put the message you you didn't get a response you didn't get an acknowledgement from it let's say you are putting a message in let's say a broker let's say a rapid time queue but you didn't get a response saying that hey you didn't get a response from rabbit mq that says that hey i have stored the message so you can assume that message is stored but in reality it might have happened that message is not stored in rebatemq for other to consume right so it depends so you can assume that the operation that you are trying to do is succeeded has succeeded but in actuality fails it leads to unpredictable user experience right you think it's done but it's not done so ignore as an approach might not be the best thing to do but in some cases it is good to ignore things and just move forward right in life and in microservices both and so a good practice to ensure is that always always always catch all the exceptions that you get and handle them in an informed way hey timeout happened waited for the response uh something went bad on the other server or other service you get a particular exception in whatever so handle all of that exception when you're making uh a synchronous tcp call asynchronous message broker push anything handle it handle exceptions and take an informed decision depending on the type of exception that you will get so there is a like anytime you make a call to the service you know what kind of exceptions to expect so do that depending on the context do it very well do not just ignore but in some cases you can do it by the way approach number two configure and use defaults so let's say in case of search and analytic survey this this fits in very well so uh wait i'll move here in there so let's say with respect to configure and use default let's say you have search service an analytic service your search service waited for less than 200 milliseconds for it to get response the only data it was relying on analytic service to get response was the total number of blocks what if when when the timeout happened which means for let's say it waited for 200 milliseconds but it did not get any response so what you can do you can use the default value let's say total number of blocks is equal to zero right you were waiting for a response but you didn't get it so you're using total number of blocks equal to 0 as a default value so this is another way of handling timeouts where in some cases it might be possible for you to use a default value instead of not returning any response to the user you can just use a default value and then send it to the user okay approach number three the most famous one retry so this is where you assume that if you do not get a response so whenever there is a timeout you assume that the remote operation failed so then you retry it right so retries are very simple when it is read request like in search and analytics use case it's very easy for you to retry so you waited for 200 millisecond you didn't get a response so your natural response to that is hey let me retry that right so you will retry the you'll you'll basically fire the same thing again on your analytic service waiting for the response expecting that now it would return you in in let's say 50 milliseconds right so retries are simple and it is simple for read request but they become very tricky when your request is non idempotent like what is hard important which means that same operation fire twice will not have any repercussions on to it so other services is handling it well so uh like you fire the same thing twice nothing like basically nothing weird happens on that so you can fire the same operation as many times and it would lead to the exact same thing on that particular end so your request is non-idempotent for example uh a good example for that of non-item potency is let's say you are you are transferring money from account a to account b if you do retry let's say you fired a query that does transfer from that transfers money from account a to account b but you didn't but but when you fired that query and you waited for that response it resulted in a timeout now if you retry this operation so let's say a wanted to transfer 10 rupees to sorry account a the overall money of basically 10 rupees needs to be transferred from accounted to account b and it failed because of time out and you retried it right so but it it was actually a success but when you retried it instead of transferring 10 rupees from a to b now 10 plus 10 20 rupees got transferred to b's account b is happy but a is not so this kind of request are non-item potent request so you need to know that if the communication that you are doing between two services if that other service request is item potent or not if it is not item potent then you definitely don't want to retry right second is like when this becomes tricky when retries become tricky is when your request is expensive let's say there is a very heavy real-time analytics query that you are firing that is computing a lot of things in real time which is which is very heavy on analytics but that is you are doing it synchronously for some reason but if it's a very expensive to compute it you might want to not repeat that operation one of the use case would be maybe a very heavy deep learning based gpu base some query or some request that you are firing not really synchronously or something but it's very heavy to compute so you you may choose not to do it right then third thing is when uh situations become tricky is when other services overloaded so the idea the the situation is let's say your search service was depending on your analytic service and you fired a request analytics you didn't get a response and there was a timeout so you said hey let me retry so you are you are continuously retrying to make the call to analytic service but in reality analytics service is actually overloaded so your retries is making the situation worse for analytic service so that's where you need to know when you should be retrying like in this case when your other service is already overloaded and you are firing that query again and again and again and again then the load will never decrease on that service so that service will never get time to recover right so retry has become tricky then so a few good practices to take care when you are sure that you would want to retry is retry with exponential back off do not just immediately like apply for loop and just just just keep on retrying immediately like it failed retry failed retro it will retract this is where you typically see exponential backups kicking in which says that hey you first make a retry in one second if you still get a timeout you retrap to two seconds if you still get a timeout you will retractor four seconds so there is an exponential back off that you are doing so one second cut delay then two second delay then four second delay then eight second delay and so on and so forth and other thing that you definitely should think about is where like if you're dependent or your if sorry if your depending service is idempotent or not so anytime you design a service try to make it as item potent as you can like just ensure that if you are leaving out a scope of idempotents it should be very minimal make it as adapted as you can because that would give you that that would give other service the confidence to make as many calls as it can without leading to any weird behavior approach number four retry only if needed so the in the third approach we chose to do continuous retry uh or what we or we talked about it trying but approach number four says you can retry but only when it is needed which says that somehow you have this option to check that hey if this requires a retry then only i'll return otherwise for example uh in some service you might have a way to check if the operation is successful or not so you fired a query to service a service a talks to service b uh service a when when made that operation didn't get a response it led to timeout but somehow when the your services tried to retry that operation it had a way to found out to find out if that operation that it done that it did previously was success or not somehow it has a way to find it i will go through an example on what exactly happens like how how it can be seen in real world but if it has a way to find if that is success or not right you can leverage that that you will choose to do retry only if the your past operation was not a success and you have a way to check it an example is let's say a user tweeting the same post twice accidentally within a particular time range let's say twitter added a check saying that hey a user if post the same server the same tweet twice within one minute it might be like user when clicking on tweet button might have accidentally clicked it twice maybe some bug or something happen right so to prevent the situation what twitter might do is like let's say the the javascript client of user is is doing that retry the user made a slash post request on twitter's api server or something it stores tweeters tweets in this database right so when user fired two queries back to back right so sorry when user fired one api request to tweet something this tweets get stored in a cache or something where you are storing user ones recent tweets right so user you are made a tweet user one recent tweets are stored in an array in some cache let's say redis or a memcache or something and then it is storing in the main database where the tweet is actually registered right and now somehow user one didn't get a response from the twitter zip so user made a slash post uh slash tweet request to twitter but it didn't get a response so js client of user1 retried it now because that retry was done very quickly your twitter might just check in the cache to see hey does this user like did this user just made that same tweet if the user did then your twitter repair might send an error saying hey i think you just made a tweet you might want to like like do you really want to repeat that right something around that so i've seen this behavior in a few websites out there with mostly social media websites in which it just gives you a warning that says hey are you sure you just tweeted the same thing a few minutes ago do you want to retweet that so do you want to again tweet that for that sense so this is one solid use case when you are retrying only when it is needed approach number five re-architect so like one thing to always think about is why do you need a synchronous dependency in the first place can you make it asynchronous so for a search and analytics service you might want to think twice before you do that where request comes to search service from search service it goes to analytic service synchronously what if analytics data is ingested in your elastic search so it removes your dependency on analytic service altogether right so always think about something like have a have a micro service silo if anyone wants to communicate to this think about can you make it a synchronous can you remove the dependency yes sorry can you remove a synchronous dependency on that other service if you can that's the best best possible thing that you can do so few good approaches this like going for an event-driven architecture or ingesting the required data from the other service into this service so that this service like your search service works independently without having a synchronous dependency so try so try to remove synchronous communication as much as you can go for an event-driven architecture reinvest the the data duplicate the data into your service like ingest it in your service and multiple ways to do it but re-architecting your solution might be the best long-term solution that you can go for in some cases it is not possible but in most cases it is so try to think in that way right so from this entire discussion a few things that we can definitely uh have as a key takeaway first always have timeouts whenever whenever you are doing communication synchronous communication with any service have timeouts you cannot wait forever right second pick a timeout picking a timeout value is very tricky like if it's too short then you'll have a lot of false positives let's say your it takes for your database or it takes for analytic service to respond in let's say 200 millisecond and you are having time out of let's say 150 millisecond so unreal expectations from it so in most cases you you would have a response but you would still be seeing timeouts so a lot of false positives if your timeout value is too short but if it is too long it is you'll have a you'll have a performance bottleneck because you are waiting for far long let's say if you are let's say if you're waiting for an analytic service to respond in two seconds so you're waiting for far long it is putting a lot of pressure on your service as well because the throughput of your service is going down and unnecessarily leading to a poor user experience and third thing is make retries safe wherever possible try to design your service in the in like with item potency in mind if you are not thinking that then retries will become very tricky for the services that are consuming your service so try to be as item potent as you can with respect to your implementation all right nice so that's it so yeah basically that's all what i wanted to cover with respect to handling timeouts in a micro services based setup so i hope you found it interesting so if you folks like this video give this video a big thumbs up if you guys like the channel give this channel a sup and i'll see you in the next one thanks
Info
Channel: Arpit Bhayani
Views: 18,104
Rating: undefined out of 5
Keywords: Arpit Bhayani, Computer Science, Software Engineering, System Design, Interview Preparation, Handling Scale, Asli Engineering, Architecture, Microservices, Microservices Timeout Pattern, Handling Timeouts in Microservices, Handling Timeouts in Distributed System, Good Distributed System Design, Distributed Systems, Robust System Design, Inter-service communication, synchronous dependency, Timeout in REST-based services, How to maintain SLA in Microservices., Database Engineering
Id: Hxja4crycBg
Channel Id: undefined
Length: 23min 38sec (1418 seconds)
Published: Fri Mar 18 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.