Building Fault Tolerant Microservices

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

now we're gonna have yet another talk about microservices which leave it on focus this conference this time is going to worry about fault tolerance and I think it's particularly interesting because making things micro-services tend to distribute things and bring all the network issues into the application so please welcome Christopher from Avanza to talk about this thank you in 1666 a small fire started in the bakery in eastern London at the time houses were built the wood I will put really close together so the fire spread quickly the Lord Mayor was called on to decide whether to demolish buildings or not to to isolate the fire nah he said the fire seems quite small let's not do that after that decision I bet you had a pretty rough week the fire spread extremely quickly it raged for several days and when it finally was distinguished it had destroyed the home some 70,000 of London's 8 2000 inhabitants wow this is a planarian it's a kind of flatworm it's an amazing creature if it cut it in half it will continue to live on it will actually regenerate it's lost half you can actually catch it in tiny pieces and it will regenerate that is an amazing resilience today I will show you how to build your micro services less like London in 1666 and more like a planarian I will do this by discussing why we need to build fault tolerant services I will describe some typical failure modes in a micro-services system and I will talk about three stability patterns to mitigate the effects of these failure modes I will also provide some monitoring guidelines form for a micro services system to detect failures my name is Christopher Allanson I work as a developer at De Anza so a bouncer is a bank that provides online services for managing your savings and investments so a quick poll how many events the customers do we have here yeah that's quite a lot maybe a whole maybe bit more great advance that we run a micro services platform that's quite scalable it contains 250 services a bit more these services run in 1,000 instances and this platform allows us to be the largest actor on strucken Stock Exchange both in terms of number of deals made and in terms of total turnover today I will tell you a story about when I was working at acne books this story is made up it's entirely fiction acne books experiences are quite similar to what the ones we have had at De Anza but makes for more colorful examples and the domain that is easier to understand so acne books is an online book store that sells antique books at acne we started small with web application connect to database standards set up work great when we had a few users the user base kept growing we got more and more customers and occasionally we ran into some performance problems in the web application we solved this by adding more instances of the web application to increase the availability and solving the issues that way this worked great for a while until we got some more performance issues web applications stopped responding and this was caused by slowness in the database we didn't really find this strange the database was used on every page it was used for everything so obviously if the database was slow the web application would be slow as well we solve this by buying a larger database from a very expensive database vendor so problem solved but business was good money kept rolling in demand for antique books were really high so we can scale the database any more and we discover that web problems developing this application this monolith where where everything was in the web application we decided to make the switch to micro services to be able to scale each individual service and to be able to scale development over several services as well and we arrived at something like this this is a small part of the system we had our web application which our customers used to browse and buy books connected to some services a service for displaying a page with special offers a service for listing the products that we sell and a service for making payments this service is run on separate machines made over the wire calls between them and each service had its own storage to to allow to scale storage separately this worked great one day then we started to burn web application was not responding customers were complaining they couldn't buy any books we restarted the web application it helped for maybe a minute but then it became non-responsive again okay sort of demonstrating what's happening well the web application looks healthy enough but we notice that the special offer service has stopped responding as well okay that's no big deal the special offers are in a separate page then just nice to have functionality but if the special offer service is not responding shouldn't cause a web app outage hmm okay we continue digging the special offer service is calling a third service a purchase history service to show special offers tuned to the customer's purchase history and this service has stopped responding as well we had encounter a cascading failure where the failure in this purchase history service had cascaded back to the web application a failure in one part of the system have affected other parts more critical parts of our system so if we compare this to the database case with a database failure affected the web application this is so much worse web approaches history service that is non-critical on one page it's just nice to have functionality but the failure that service prevents our web application that all users use and working at all obviously we want to prevent these kind of failures in the future we start discussing what can we do how about increasing the availability of our services let's have some redundancy do more testing do additional code reviews we estimated that we could get probably every service instance up to an availability of five nines this would be equal to about five minutes downtime per year per service that sounds great let's do that right well we had one smart developers that hold on a bit we were talking about micro services here we're bound to have a lot of instances of our services think about the future what if we have 1,000 instances if we allow failures to cascade freely when we have this amount of services then the combined availability will be fine lines ^ 1000 and this equals roughly two lines availability or 87 hours downtime per year now this is quite significant so we realize that we have to expect failure when we have a lot of services there are a lot of moving parts some of them will fail despite our best efforts we have to design for failure or our system will burn down repeatedly like London did in 1666 so we realize this and start investigating the failure mode what happened when the purchase history survey stopped responding and affected all the rest of the system we sorted look at the web application because that is the component we want to save we're on a standard web container for the web application contains a thread pool with a fixed number of threads the blue boxes on this imagery illustrates the threads and when the user wants to use a page a request comes in it is assigned one of the threads from the thread pool the thread in turn calls the special offer service we want to view that page makes a synchronous call and wasteful response when all is well the service returns the user gets its response rate but when the service is not responding then the thread will wait for reply and while waiting for the reply it will be blocked it can't handle any other requests okay so for no big deal where one blocked thread the other threads can still be used for other stuff our users can still browse products and make payments so they don't notice anything yet but as more and more requests to the special offers page come in more and more threads will be blocked until eventually all threads will be blocked this can be quite a quick process I mean water just used to do if a page doesn't load well they hit refresh if it still doesn't load well hit refresh a couple of more time for good measure so a user can look up a great number of threads just by hitting refresh a couple of times and when all the threads are blocked when additional threads come in they are in queued and this is a really bad situation because even if our service would be would recover and the threads would be unblocked we still have to handle all the queued requests so we really want to avoid in queueing requests okay now we understand what's happening on quite a low level in our application but the question is why doesn't the calls to the service returned ok the service doesn't respond but there they return anyway we look into the code the code that calls the services looks something like this in this case we create the URL to the special offer service we create a URL connection we open that connection connect it get an input stream and read the response from that stream okay but what why doesn't this return well we realize that connect and read time outs in Java they are typically infinite this will block for an infinite amount of time if the service never responds this made us realize the first stability pattern use timeouts timeouts prevents blocked threads we first are timed out of course I timed out our threads can do other stuff in this particular case we could set the connect timeout can set the read timeout on the URL connection the way to actually set timeout will depend a lot on the kind of client library that is used for service communication but in this simple case these two lines were enough we also realized that timeouts needs to be aggressive for one the user experience is better if you just wait for like all of a second before getting error compared to waiting 5 seconds before getting an error it's also less of a chance of blocking all threads if you have aggressive timeouts and if you have a web application for example that calls many services we have to consider the sum of the timeouts if all the services would timeout that can become quite significant if you have long timeouts shouldn't we have timeouts here as well between the special offer service and the purchase history service after all the purchase history service was the one that started to act up in the first place well yes absolutely here too between all the service interactions there should be timeouts in fact the special offer service is just another web container and the same discussion and failure mode when applied to that as we just discussed the only difference is that instead of users there are service calls coming in ok great we introduced timeout at Acme books timeout for all the service calls things work great for a couple of months perhaps some services that became slow and didn't respond but the time Mouse saved the rest of the system but one day we had to fire again customers complaining going to buy books we had terrible response times now web application we're awful throughput okay what's up now have what we solve these kind of errors where the web application would be slow let's investigate what's happening well it's the special offer service that is acting up again but course tutor timing out so that seems ok Oh time also working but we notice that the service is called a lot more often that then it usually is there are a lot of timeouts in fact there are so many timeouts that it seems that our threads spend most of the time waiting for these timeouts they don't do much other work at all and this course is our throughput to go down the drain we get a throughput that is lower than the number of incoming requests so say that we have a throughput that we can handle 30 requests per second we have 40 requests per second in coming to the to the web application we can keep up with the load and this will cause requests to being queued and since we can't keep up with load the queue will keep growing and it is bad because if the service recovers we have a really large queue to handle really large backlog of requests so what has happened why is everyone suddenly so interested in this special offers page why are there so many calls made to it well it turns out that someone had made a change to display the special offers on every page in the header by these books seems likely quite insignificant change when you make it but can have it quite quite scary effects so this made us realise that if you have a frequently called service then timeouts are not enough we need something else because having each thread waiting for timeout will cause the throughput to be too bad this led us to the second stability pattern circuit breakers for non-developers circuit breaker is an electric switch which protects an electric circuit from overload detects this overload and breaks the current similarly in software development the circuit breaker prevents calls to to a broken service and instead of letting a call through the core is failed forced if the service is judged to be broken this is great because we don't have to wait for timeouts and it also offloads the service we don't lit cause through if you think it's broken or overloaded or something so this is what a circuit breaker looks like in real life pretty cool I try to make my code look this awesome but still I failed so let's see how it works in more detail let's assume we have the web application of course our special offer service the special offer service replies with the response and all is well but when the service is low instead of a response we get a timeout what the circuit breaker does is that it takes this timeout and it opens the circuit breaker so additional calls that will not be let through they will return immediately with an error instead we don't have to wait for the timeout so great this way we have solved our throughput problem we had an error immediately instead of calling our non-responding special offer service on every page great but what if the special offer service recovers then we probably want to show the special offers again right yes indeed what what we do in the circuit breaker pattern is to periodically let the signal call through to the service so if this single call fails then the circuit breaker is kept open and we continue to fail forced and returning errors but if this call succeeds then we will close the circuit breaker and to the closed state again and resume normal operations there are a couple of options when you can choose to open your circuit breaker it might be a bit premature to open it if you get one time out because there will be some jitters and you will get some time Mouse due to garbage collections or network hiccups what we usually do is to to open the circuit breaker when there are time outs or resort to threshold let's say that if for two percent or more of the requests time out then open the circuit breaker we can also open it if you have unhandled errors of any kind let's say we have 50% unhandled errors of any kind okay our service is probably a bit broken so let's open a circuit breaker and offload it for a while maybe we have some known he recoverable error like machine is burning exception and we know that if one called gets this exception then additional calls won't succeed until the underlying error is fixed and we can as well open circuit breaker immediately one thing we also discovered when we put the special offers on every page was that as soon as it generated an error for example timeout or when the circuit breaker was open we got this beautiful little page that's not very user friendly so if I were with special offers in the header you should request crashes we probably just want to show some sensible default right and we learn to handle our service call errors in a simple form just to try catch and return a sensible default value to avoid whole page crashes or crashing more of your flow than it's actually necessary to use timeouts and circuit breakers things work on for quite a while got some more problems also been frequently cold services but the circuit breakers saved us until one day things start to burn again oh don't we ever learn this time the symptoms were terrible response times in a web application or for throughput in the web application sounds familiar yeah basically the same symptoms as last time before we introduced the circuit breakers tour to investigate what's happened well it's a special offer service is slow again hmm yeah of course we tell the developers to fix their issues as well but we also learn that okay we need to handle the systemic problems with the services affecting our downstream services so we need to look at what why is the web application behaving strangely when my special offers is slow well in this case had a slow response from the service but we didn't have a timeout but we still a slow responses and the threads were spending most of the time waiting for a response and this Naida with loop would be really bad and well throughput lower the number of incoming requests again so this made us realize as well that if we have slow response times but they are less than the timeouts that we have set then we need more protection timeouts and circuit breakers are not enough this led us to the third stability pattern bulkheads for non-developers bulkhead is a watertight compartment on a ship the intention is that if the hull is breached only one of the bulkheads on the ship should be flooded and the ship would continue to float similarly in software systems bulk cases pattern to isolate components from each others to prevent Falls in one component to affect others then prevent prevent cascading failures an example of a bulkhead would be to limit the memory usage the allowed memory usage of a process so if it would show a memory leak then it would be allowed to just use up a certain limit and would not eat up all the memory on the machine and the rest of the processes would continue to live on that's one example of the bulkhead there is a specific kind of bulk case that is very powerful in the microservices setting it's to limit the number of concurrent calls to a service limit the number of calls from one service to another this will put an upper bound on the number of waiting threads so I will talk about this a bit here if we if we have the web application and the special offer service we can add the bulkhead of size say 2 in front of this service between web application and the special offer service but only this we are saying we will allow at most two requests concurrently from the web application to the special offer service like that if we already have two concurrent requests two outstanding requests to the service and we try to make additional requests these fail forced with an error so in this case we can never have more than two requests waiting two threads waiting for reply from the special offer service so adding these kind of bulkheads would have saved us when has slow response times but no time outs from the special offer service this would be experienced in a couple of different ways for the user the user that is the users that are assigned to the threads that are left through the bulkhead they will get a slow page load because we will wait for the special offers but their page will include these offers to thread the other threads that are reacted by the bulkhead they will get a false page load but they won't see the special offers because yeah we'll get an error and we return the default value the bulkheads they are put this a type of bulkheads are put one bulkhead per service we put them on the cooler side so in this case we have a bulkhead or two of size two to the special offers service a bulkhead of size four to the product service and about eight of size three to the payment service from the web application and this way in this case we have an upper bound of totally nine waiting threads in the web application we'll never have more than nine waiting threads do to slow services and this offers great protection against cascading failures our services can behave basically however they want and we are quite well protected but only if our bulkhead sizes are significantly smaller than they request pool size if you have a bulkhead when calling a service that is of size 40 and we only have 50 request threads in our application then we can still block up most of the threads and get bad performance so meter size our pollock heads small we can reason about bulkhead sizes by looking at the peak load when our system is healthy so let's assume we have 40 requests per second to service and these requests are handled at 0.1 second so this would be a peak load when calling service we have this scenario a suitable bulkhead size would be to multiply these multiply the for two requests per second with the response time of 0.1 seconds and we lead us to a bulking size of four so that would work if we would had a constant load but they were probably some variation in loads there will be some jitters so we'll probably need to add some breathing room to this as well okay let's add an additional three so that's basically how we thought when once I see our bulkheads an additional benefit by using this type of bulkheads is that it protects services from overload if we have a runaway client that tries to call a service extremely frequently extremely concurrently it won't be allowed to do that because we have a bulkhead to limit and rape the number of concurrent calls a straightforward way to implement a bulkhead is to use a semaphore a semaphore is a comparison construct to limit the number of concurrent access to resource a semaphore is initialized with the number of permits in this case two permits which allows us to use resource concurrently in two different threads when we want to protect a service using this bulkhead using this semaphore bulkhead then we start by trying to acquire a permit from the semaphore with a timeout of zero seconds so if this acquire fails we won't wait for a permit to become available if the choir succeeds then we'll call the service we will call special or first or kept service that get offers in this case and we will finally release our permit from the bulkhead if the Troy acquire call fails it means there are already two threads executing the code in orange also a new exception reacted by bulkhead exception we are not allowed to call this service because we're already two concurrent callers okay great timeout circuit breakers bulkheads now in great shape right nothing can stop us not quite it actually worked great for quite a while but one day when we had a extra high load start reborn again it seems not this time when many threads waiting for replies from services and were quite few threads available in a web application and all the service callers were rejected so we actually have sized our bulkhead small what we call a lot to services from the web application and what we saw was that all our bulkheads that we had we had these three bulkheads and a lot more it didn't fit into the picture so all these bulkheads were full of blocked threads and this left just a few threads in the web application to handle the remaining load okay threads are blocked and calling services shouldn't we have something to protect against that right like timeouts which we introduced early on but where had they gone why are we waiting here while we blocked waiting for service calls well it turns out that we had upgraded I would client libraries that we used to call our services this upgrade had changed the way the library was configured so the timeout settings that we had made stop working we had no time us anymore so if ever broken client library more protection is required and we discovered in the future that this is not too uncommon I mean client libraries they are blackbox they can make all kind of nasty things you can add extra calls they can make retry SAP on failure or they can simply have a budget to deadlock the thread what we turn to them was thread pool handovers when calling a service we hand over the coil to a separate thread pool the thread pool in turn calls the service now we're reading a request thread ways for the reply and when the thread pool returns will reply we get the reply on the caller thread and this way we can always walk away how will request thread our calling threads can always walk away from the service calls so this is a kind of generic way to set timeouts it's not the stability pattern per se it's just more of a generic way to always ensure that you can time out your service calls from the request thread perspective so let's look in detail how a thread put handle works we have a web application with I will request thread pool webinar the thread pool we call it the service thread pool this thread pool will be responsible for call in the service when we want to make a service call the request thread pool handovers the call to the service thread pool the thread wheel makes the call returns with the reply from the service and the request thread that is waiting gets the reply so hand or with a call to a pool wait for a reply and then return if the service wouldn't respond for example when the clients library will be broken then we will get a timeout when waiting for the reply we go ahead and have handed over the request the service the thread pool and we will return with an error so the thread receiver blocked in the service thread pool but the request thread will be free again to do other stuff we will never block up all these threads so we still need to have time us there okay we we managed to save the request threads this time but if you don't have time outs when calling the service our service thread pool would be full of blocked threads and all the calls would be rejected initial bonus when using thread pool handovers if we have a fixed number of threads and we don't NQ requests then we get an included bulkhead if we try to call the service when water they have three outstanding service calls then we will fail immediately with an error so this is great we get generic timeouts and bulkheads with the same construct a straightforward way to implement thread pool handovers is to use a standard java thread pool executors here we neutralize it with three three threads the first two threes or the number of threads and we use a synchronous queue now this is important because if you don't specify the queue to use request will use an unbounded you assume Chris queue will prevent requests from being and queued and we will instead react executions when all the threads are used and when we want to protect our get offers call using this thread pool handover we submit the call to the executor we will get a future in the return wait for this future with our desired timeout in this case one second if we get a rejected execution exception when submitting our job it means that we already have three outstanding calls I was very pool is full so we throw a reacted by bulkhead exception so we say ok this is our bulkhead thread pool it rejected the call if you get a timeout when waiting for the future we throw a service call timeout exception so try to pull handles are very powerful get both generic time outs and bulk casts in the same construct but we were a bit worried what about performance should we hand over our service calls to separate thread pool each time doesn't that add a lot of overhead well we measure it a bit we discover that we are already making over the wire calls so the extra performance hit of submitting your thread pool is quite insignificant in the few cases where we thought it was significant we simply used semaphore bulkheads instead and made sure our clients libraries could be trusted to always honor our timeouts that is the option we have excellent now we have thread pool handles as well and we have a lot of services our system a grown over time it has become a complex Beast and there is also a lot of knobs to turn we have timeouts bulkhead sizes thread pool sizes lot of configuration we probably need to throw in some monitoring here to be able to get a better understanding of our system but what should we monitor yeah of course obvious stuff heaps Isis CPU utilization or that no-brainer but what is specific for a micro services architecture what can we do there well we discovered that a great place to introduce monitoring is to monitor the service calls we have a lot of services to get a good understanding of them we can monitor how they are called this our integration points these are the places were integrating the system with with itself it's also a good place to detect configuration problems if we protect all our service course with these stability patterns so we need to monitor that we have configured these patterns correctly a crucial thing to measure is the timeout rate the rate at which calls to service or timing out that's a good problem detector and it is also a good place to validate configuration have we configured our time must properly rejector core rate the rate at which our callers are rejected by our bulkheads it's also a great problem detector because if we have an overloaded service it becomes slow or we have a network problem will quickly get reacted calls the short-circuit rate the rate at which our calls are being failed by our circuit breaker is mainly to ensure that we've configured the circuit breaker breaker open criteria properly we also measure total failure success rate to get a good understanding of the load on a service and how often it fails and also the response time of our services what are the response times when call in the service we actually learned that response time and rates they are they're good when drilling down and understanding a problem but by far the best problem indicators is to look at the timeout rates and reacted call rates because as soon as there will be some problems they will manifest themselves their response times are harder to get an understanding about which slow response types actually mean here is an example from our monitoring system this chart shows the number of failed service calls per minute the different colors show different types of failures the green or rejected calls red or timeouts and purple which we don't see any of the short-circuited calls and as you can see there are some errors in this system we have about 100 hundred 50 failed calls per minute but this is no big deal we have a system here we are having a load of several millions service calls per minute having a failure having 100 200 calls feet per minute it's no big deal it's actually to be expected if you have configured your system properly then you should be expected to see some some failures because there will be some jitters in your system there will be garbage collections there will be net up network hiccups we'll be some requests burst you should rather be worried if you don't see and the arrow said oh well the problem means that we have configured our time Mouse way too large and the particle sizes too large when we saw problems with this in this monitoring it's quite tempting to run away and chew on the configuration parameters or we'll get rejected calls that increase the bulkhead sizes we have timeouts let's increase timeouts you have probably configured the system wrong but we learned that this is you that is often the wrong response we need to first understand the underlying issue before running away and changing the configuration because changing the configuration without understanding what is happening in the system might make the problem worse if increase the bulkhead in case we get rejected calls well maybe the rejected calls were because the service was overloaded increasing the particle size will make it even more overloaded not great ok excellent we have monitoring as well so we have some service call monitoring to detect problems we have our bulkheads thread pool handovers timeouts circuit breakers the system is actually quite stable it works great for for quite long we're happy a finally when you save the system we run into some problems some hiccups but we don't have this cascading behavior the services that fail they fail on their own the failures are not appropriate to the rest of the system webley's planarian stage we can almost regenerate the system when we lose half of it but not quite so some of you might say that oh all this it seems like a lot of work do we have to do all this well yeah if you have to implement all this by yourself it might be a lot to work fortunately draw some third-party libraries to help you the developers over at Netflix has made some awesome stuff for example this for torrent library hystrix history 'kz implements these patterns it implements circuit breakers red pool handovers bulkheads it's actually quite great you should check it out I will show you a short example of how to use hystrix to protect your service calls here I want to protect my ghetto first call from the special office service what we do is to implement a class that extends the history command we'll implement a ghetto first command we initialize with initialize it with a command group key this is the hystrix concert hystrix concept - to group together similar commands by default commands with the same key we use the same thread pool iover ID the run method to make my actual service call here we do this special first what get offers it's actually work and when i want to call the service protected by history we create a new instance of the get office command and call execute on that instance by doing this will protect the command with a thread pool handover which also acts as a bulkhead we will get a circuit breaker there and hystrix will also help us with some monitoring they will record this this call and we can get some metrics out of it later on if we want to additionally histories helps us with the handling our service called errors we can implement we can override a the get fallback method from the command histories man and provide us with a default in case the service call fails so this gate fallback method will be called in case the run method throws an exception history's also provides some real-time monitoring data lots of blobs shows the actual load on the history commands in real time can zoom in no one to see some details lots of numbers here but just to get my day of what hystrix in happy with you see the number of cores per second the number of failed calls the number of successful calls the response times so it gets some great real-time monitoring capabilities from hystrix as well so to sum up when we run the microservices system we have a lot of services they can fail in a lot of different ways there lot of moving parts we have to design for failure to handle these failures we do this by ensuring that our service calls always timeout we do it by implementing circuit breakers to fail fast and offload broken services we use bulkheads specifically the kind to limit the number of concurrent calls to services to prevent cascading issues and we monitor our service calls to detect problems and to verify that we have proper configuration if you want to learn more about these concepts I recommend you to check out the great book release it by matatini guard he goes into depth about these patterns and also have lots of other good guidelines on building stable production-ready systems and also the github pages for hystrix contains both information about hystrix of course but also in-depth information about these stability patterns attributions for all the nice images I've used thank you artists thank you for listening so if you have any questions we have a five six minutes and you had some examples there with the for example how to calculate the size of a bulkhead how are those numbers affected if you have multiple clients and multiple servers so the question is we had an example of calculating bulkhead sizes so how are these affected if you have multiple like service instances those servers yes yeah and money that's a good question well since the bulkheads they are put at the cooler side so that we would put that each instance of your callers if you have like several instances or web application we put in each instance so you have to calculate that okay how many cost we have from this instance to that exact service instance so you have to like see the point-to-point correlation there but if there is a load balancer in between yeah but in the pollak it is at the client side okay the question was what if there is a load balancer in between we do look at the load than the response time from the caller side so it doesn't matter how the service is implemented behind it if it's low banner said it's just one instance look at the number of calls from the client and the response time for those calls okay I think I have a sort of related question to the multiple clients in your example which was great there was mostly the web app that called many different services but you may have an environment where your your back-end services if you will have many clients many different clients so if I understood it correctly many of the patterns were implemented on the client side do you have any experience with with solving that so the question is the patterns here on the client side but what if we have a like a service that is called by many different clients to do to me like though how the Baloch case are they would relate to that I mean I'd rather not to implement the protection several times right and maybe I have a diverse environment with different kinds of services maybe even written in different languages would I need to implement the protection several times or so do I have to implement the protection several times I might have their services implemented in different languages and such well yes you will have to implement it several times you need to protect all your services yeah I don't know calibrate a bit maybe no I'm just thinking if you had a brilliant idea so I didn't have to do that no I don't have a brilliant idea unfortunately for avoiding that no I mean it's what one way would be to like okay maybe you can implement something on the server side to limit the number the number of callers there but that becomes hard because new services shouldn't really be aware of your clients how they are calling you so you probably they put you in a client-side form for all of your services okay thanks so to throw another curveball in there what if you have auto-scaling services calling other services is that something can you dynamically adjust the threshold of your bulkhead Senor timeouts so we could repeat the beginning a question if you have autos anything the auto scales like dynamically be like you get an unexpected amount of an expected load and for instance your front and scale up and they're all configured the same then obviously if you get twice the number of front and scrawling you'd probably want to have the numbers on the bulkheads would that be possible do you know if that's even possible okay so the question here is what if we have outer scaling like the sample if web web application that scales up to the double number of instances then what we do can we somehow automatically configure our bulkheads because you probably want less bulkheads well I think I think it's most important to realize that the goal of the bulkheads how to protect the client so the bulk case should be configured to protect your client despite what's happening on the service side so even if you scale out the web application layer don't think it should have been have to be worried about having to the combined bulkhead sizes of those would be too large unassuming a worried about that we love too many requests to the service site that might bring that down basically any kind of water scaling the the backend service is the front and services I mean how that would affect the configuration of the bulkheads and other protections I don't really think it should affect it really because the main goal of the bulkheads or to protect the client so if you scale it out they should remain the same okay possible because yeah sure maybe you're disputing the load where more instances you to decrease the podcast cause of that but still hmm I don't we should probably discuss it out so I don't have to go down so for you there okay I think that was alright one more quick question yeah luckily this is a good question how deep to fan out I mean how many steps do you call at most to other services so the question is how do we found out how many steps to record other services to me specifically at avanza yeah what what's the max oh that's a good question well I'm not sure but I think it's not common to do more than two three steps I think those cases are quite rare yeah okay thank you very much yeah thank you

Info

Channel: Jfokus

Views: 13,949

Rating: undefined out of 5

Keywords:

Id: pKO33eMwXRs

Channel Id: undefined

Length: 51min 16sec (3076 seconds)

Published: Tue Feb 23 2016