Availability and best practices for the Azure Cosmos DB .NET SDK - Episode 13

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] [Music] [Music] [Music] so [Music] [Music] so [Music] welcome everyone cosmos tv live tv uh thanks for joining us this week uh episode 13 and this week i've got with us a very special guest and someone i really admire uh matthias curranta uh matthias works on our net sdk team and he is responsible for some of some of the most loved features i think maybe in our sdk uh i think uh you wired up the change you don't do the backing for the change feed but you do all the front end all the sdk piece for the change feed i think uh bulk mode i think is you if i'm not mistaken uh transactional batch is you if i'm not mistaken uh i can't tell you how much uh i love transactional batch because you know why i i really hate writing stored products in javascript you really now that i can do them in the sdk i i basically i show that to anybody and everyone whenever i'm doing demos i'm like check out this feature transactional badge you will love it so anyway welcome good to see you thank you thank you mark thank you everyone for having me for me it's a super pleasure to be here uh i i got the the lucky number uh for the episode uh yeah i i was able to i was lucky enough to to work on the sdk and we are a team so every single feature is like you said it involves backing it involves front end and involves sdks involves different libraries so it's it's kind of like a team effort to ship those features and it's super uh super great when i when we see users actually using them to be successful and build applications so yes it's always a pleasure and that's basically the idea with this talk today i mean today you're going to show us some of the coolest kind of tips and tricks internals advanced topics basically i guess if if you're a user of our dot net sdk and you're trying to eke out every single kind of ounce of performance out of this thing this is probably a good podcast for you to join would that be right yeah yeah we will try to kind of pop up the hood i think that that's the expression to to know what's how what's happening inside the sdk in terms of connectivity uh what happens on each operation uh which are the best practices the idea here is is if you are writing an application from scratch or maybe you already have an application in production uh just check which are the things that you should be doing to avoid the most common pitfalls and then we'll cover uh the most known scenarios for close relational failures because as we say from cosmodb we want to provide the highest possible availability and the sdk plays a high a kind of like a really big part in that uh so we'll go through different scenarios the most common ones and how the sdk behaves and which are the expectations that you should have in cases where your sdk is configured correctly this is great we have not uh yet talked about availability and failover on this show so this is great that you're going to talk about it uh today so fantastic so where are we starting off with uh with your talk here connectivity mode exactly so before talking about how the the connections are maintained and how the payload travels and everything and everything everything i wanted to under to start with the basic which are the connectivity modes that you have available in the sdk if you are building an application which are the options that you have when you are using the sdk to connect to your account but actually before going to that i wanted to take a couple of takes steps back and do an overview of the system topology we know that we can create containers or collections and when we create containers and collections these containers might span multiple physical partitions as you as your data needs grow and you keep storing more information these physical partitions will keep increasing to accommodate your data needs now each physical partition is actually composed of what we call four different replicas and these are four different machines in four different racks in the data center so the failure the failure percentage or the possibility that all these four machines fail is rather small because they are separated in different uh azure racks uh and if you are adding multiple regions to your cosmos db account then this partition is actually replicated across each region so you have four replicas in each region whenever you save the document on the on the cosmos tv service what happens is we take the logical partition key that you are sending in with that document we applause apply a hash algorithm and that tells the the system which physical partition that document should be stored in when that document is stored in one of the replicas then that's also replicated in all the full replicas within the same partition so your document is actually copied or replicated four times within that same region and then if you add more reasons then that document is also replicated across all the replicas in all the other regions so that's kind of how we are achieving a high availability of of the data because it's actually being replicated all over uh and the the chance of actually all these components all these machines across all these regions failing at the same time it's rather rather small so now that we know that a physical partition is backed by four replicas that kind of will help us understand how the two different connectivity modes work now kind of in front of these replicas and partitions there is a component we call the gateway if some of you might have read the rest api documentation for the service you know that you can interact with the service through the sdks but you could also write your own rest wrapper rest api wrapper and send http request to the save to the service and the component that takes care of receiving these rest api calls is something we call the gateway the gateway is a set of machines that is in each region and handles all the http requests that hit the service and provides information for for example getting the account information like the consistency the topology which are the different partitions that we have available the information about the collection information about routing that we'll see in a little while but it kind of gives you all the what we call the metadata about the account and it lets you do operations uh that we define as management plan which is creating a database or creating a collection uh and then it also enables you to do uh data plane operations like creating a document uh when you interact through the rest api endpoint now that kind of gives me the the the kickoff of which are they which is the first connectivity mode that you have available when you use the sdks and with which is the gateway and by by the name itself what it says is the sdk will use the http protocol to connect to the rest api and then obviously the gateway to issue all operations all management plane operations when you create a collection or you create a database and also when you do data plane operations you create read or execute queries everything goes through the http protocol through gateway and then gateway what it does is it opens these tcp connections to the backend replicas to actually execute the operations it gets results from the backend replicas and then basically pipes or returns the result to you or your application as decoder the other connectivity mode that we have available is the one that we call the direct mode which is basically your application instead of going through the gateway it connects directly to the backend replicas through tcp so it opens the dcp connections to the vacuum replicas and the backend partitions and then executes the operations directly there so which are the big the big difference between these two modes well the first one is obviously the protocol gateway uses http while direct mode uses a tcp protocol gateway mode is more compatible with enterprise environments that have you know proxies or hard network rules where you can only open a particular subset of ports maybe very limited uh where where you can't uh you know you are limited in terms of which are the urls that your applications or the endpoints that your application can connect to um it obviously uses less less ports and less connections because we are basically just opening up an http connection to the account endpoint which is normally your account name.documents.azure.com and open and using the port 443 for https connectivity and then sending the operations through there obviously if we kind of remember the diamond the diagram we just saw the gateway mode has more or has a higher latency than the direct mode and the reason is you have a hop then network hop in the middle you're not directly reaching the backend replica but you are just connecting to gateway gateway is relying your operation to the vacant replica getting the result and then returning the result to you so you have that extra latency there which is obviously what the actual benefit of the direct mode the latency on the direct mode is uh is actually backed up by our slas uh so the operation latency that you have once your request reaches the replica it's obvious it's always you know i think if i remember correctly it's like 10 milliseconds for read and write operations uh and and that's super fast compared to to gateway not saying that gateway is going to be super super slow but rather that if you compare the latency between the two modes direct mode will always be faster and that's one thing you need to you know keep in mind when you are making the choice i mean we always recommend direct mode but direct mode while it has the pro of being faster it also might have the the issue of using a higher number of connections and a wider array of ports that you need to open in turn when you are connecting to or using the the connectivity mode so i often hear it with customers who have issues with direct clothes uh snat exhaustion right exactly thing in there and they can get around that if they use what like a private like a v-net or something using service endpoints i think uh is that correct yes yes that's correct i mean there are a couple of workarounds um normally and there's i don't wanna go over it right now because it's something i i will comment uh in a bit but yeah it's one of those scenarios where you need to keep in mind that uh you know this will use more connections than gateway mode this will use more ports so if you are your application runs inside the like an enterprise environment you might need to open more ports in those network configurations uh so those are things that you need to you know compare and see which of these two modes is the right for me in this particular setup and environment got it so what the i guess the difference then too is and you have direct mode you still have to connect to the gateway first of all and can download or grab all the all the mapping information correct and then it just basically it basically treats your client like a gateway doing all the physical partition the pk range mapping rather than having to do it from the server right exactly and that is actually that was a good good point because that's actually the the next point where i'm gonna discuss so uh so normally the first question is what okay so i have the two connectivity modes i i create my client and i do my first operation so what happens when i do like for example create item i create a client i call creating what what happens then so what we do is the sdk will on that first request it will call to the gateway using http protocol and obtain some some information that we need to in order to execute that particular operation and that information is the account information for example which are which regions is the account replicated to which is the consistency for this particular account we also fetch the container information you are executing this operation in some container or some collection so we need to understand which is the partition key definition or partition keypad for this container and finally like you mentioned we need the routing information which would tell the sdk which are the addresses for the different partitions that this particular container is is using once we have that information what we do in the sdk is we store it in a memory cache and then we open the tcp connection to the backend replicas and execute the operation when a second operation comes afterwards the first then we have all the all the information that we require on the cache and we will just reuse the tcp connection which is already open and send that operation to the partition now in the case where maybe this second operation is to target a different physical partition the other difference is we will open a second tcp connection and execute this operation on a replica of the other partition now obviously this example has only two partitions but this is how this is when you start to see why the direct mode might have more connections at the gateway mode because as your partitions keep getting bigger or if your collection has many partitions then these tcp connections need to be established to the different partitions as your operations keep uh you know happening right uh so this is kind of what happens underneath the hood whenever you you start to perform operations and i have a small since we have a plenty of time i i cannot feel it as a small code sample to to showcase that and to show you uh so this is b3 sdk it's just calling create item with like a very random model uh so what i'm going to do is i'm going to just run this and on the other side i have a fiddler so what i want to show you is which are the http requests that happen on the first request or the first operation that i do on the sdk and then that those requests don't happen after after that first one as i keep doing operations right so what i should see is once i start the the application and i hit that that endpoint here um okay this is okay i got it so this is my filler i'm gonna i'm gonna do some zooming here because fiddler that doesn't let me you know increase the phone so bummer so once i execute that operation and i go to fiddler which should be here now i'm gonna do okay that's the breakpoint so the first thing that we will see is this uh request this request is actually the request to get the account information you will see that that's my account name.document.com and it's actually kind of like an http request to the to the root to the root of that address and that returns the if we open this guy and we go to what was the actual response you will see that that is actually the information for my account which are the reasons for example this account is only in west us two right and a bunch of other information particular to my account itself and the sdk is storing this in an internal cache now the same goes for the the the collection or the container you see that in the in my case the the container is called episode 13 so we will have a request here with the account with the collection information or the container information and then we have the routing information like you said which are the partitions and which are the addresses so once we have that information then the sdk can resolve the operation you won't see the the actual operation here because we are on directory mode so filter is only tracking http request uh so what i'm doing is i'm just starting a while loop like an infinite loop that obviously i have to say the breakpoint but you will see that this guy is running because i'm seeing the logs here creating new items but there there wasn't any new http request in fiddler and the reason is we already have the caches we already have the connections open so the next request are just hitting the the replicas directly and that is basically the the effect of how how the sdk handles the the initial connection and the initial operations yeah that's cool so going back here now uh this is a i kind of explain that to because it will help us be a a baseline or or or some starting point to understand okay this is all great but what happens when things don't go as planned and which are the failure cases and failure scenarios that we normally see uh when troubleshooting issues on the sdk and obviously which are the things that you can do to mitigate or avoid those pitfalls before actually listing or going through all the different scenarios one important thing is both the java and net sdks expose some critical information in in the form of what we call diagnostics so either for a success scenarios where you are performing the operation and the operation completed successfully or for failure scenarios uh in this in this case we have the in.net we have the item response or the response in the of the operation and we have a property called diagnostics that diagnostics will have three three important properties or or data points one is the elapsed time so how long did that particular operation take between you calling the or executing the the operation and the sdk getting the the response and then returning the response to you another piece of information we which we'll see why it's important in in a little bit is which are the contact regions well one particular operation might in certain conditions span multiple reasons and we'll see why in in a little bit but it's it's important to know that that information is available for you on the diagnostics that are they are either on the response or on exceptional scenarios it's actually the same type so it has the same api surface it's diagnostics dot get contacted regions in the net uh and the other important part is actually the two string or the serialization of the diagnostics what what we'll see here something like this so it's actually a json that will describe exactly how much time it took on each step of the sdk what was the duration of each network request uh things like uh what's the cpu measurements on the actual machine for this particular during the the time this particular operation was executed on and some other critical information like for example we we measure uh the serialization of your payload of your model of your in this case your your t or or type that you're using and we measure it how long how long it took on the sdk to actually convert that that object into a stream that can be sent on the wire we also measure like i said before the cpu history and we measure the network request that the sdk did so here in this case in this example we see this is the name of the tcp request that in this case it reached to one of the partition one of the particular partitions so here is even the address of that partition which if we kind of rewind a bit we saw those those exact responses on fiddler i mean we saw which are which were these addresses so this was the the actual diagnostics of the sdk opening typing the tcp connection executing the request and stamping the diagnostics with how long it took for that particular request to complete and then we also have the if there were any any http requests that were also part of the operation so here we we see the same things that we saw in um in fiddler we see the the requests for the addresses and the other ones so it's it's basically kind of like a the diagnostics is everything that happened happened in the sdk that could explain why this operation took the time it took or why this operation failed in the way it failed and this same thing like it's on the end of the sdk it's the same thing available in java so this is valid for both java and dot net sdks and the amount of information is the same i mean the api names might vary a bit because it's in a shovel and then then might have language differences in terms of how to uh access certain apis but the information is still the same you still get the duration you still get the regions contacted and you still get the serialization of these diagnostics also in java now now that we know that this information is available the biggest question is okay so so i i as a customer or as a user should i store this information on every request and the answer is our recommendation is you can or it would be beneficial for you to store it in exceptional scenarios or in other scenarios where you you get an exception and you you are not sure what happened so store the diagnostics and for success cases we've seen customers conditionally store these diagnostics for a request that they want to analyze for example they see a request taking more than x amount of time you can check the duration from the diagnostics or you might be having your own uh stopwatch or time measuring method that will tell you this operation took longer than whatever we were expecting it to take so you can store the diagnostics and we can use the diagnostics to analyze why that particular operation took longer and so it's a matter of you know making a decision on which is your threshold maybe that threshold is uh you are providing a service to someone else and you are you have an sla to uh you know to to enforce on your own service which also uses cosmos db and for some reason you have an operation that took longer than your sla in your operation so you might want to you know gras capture those diagnostics and understand what went wrong on that request to take longer so those are things that you can do to to capture those diagonal diagnostics but that is information that we will use in the next couple of slides to start troubleshooting potential issues now when we talk about errors or or failures it's easy if we split this in two groups like which are the failures that can happen on the client environment on the client machine or on the client network and which are the errors that come might happen on the backend and how to actually diagnose and troubleshoot those the first the most common error is what we call the goal start which if we kind of refresh the diagram we showed earlier with the caches which is the one that we are now seeing on screen we know that the first request is uh always takes a bit longer and the reason is the sdk needs to fetch this um cache information or routing information if you will to obviously open that tcp connection and execute the request subsequent request won't take as long but the first one might have a higher latency and this is the reason now how you can improve the latency and avoid this cold start there are a couple of alternatives sometimes you you you can do like a like a phantom request like a trying to read a document that you know doesn't exist as part of your application initialization that will populate the the caches but obviously will fail the operation but you you can be certain that all those caches are populated the other alternative and currently it's on something that is only possible in dot net is and i'm gonna just call it here is if if you have a let's let's actually delete all this and reuse the information so let's assume that you have your your client with some configuration we'll see which are these things in a bit but let's let's say that you already have your configuration available uh the the cosmos db3 sdk has this api which it comes very handily cosmos client cosmos client dot create and initialize so the creating initial is async api get con i have a a helper to get the connection string but this guy here the the point is it will not only create the client as if you were creating it from the normal constructor but it will also fetch and populate all the caches required for a particular subset of a collections that you want to interact with in this account so using the uh old open method on v2 sdk isn't it exactly so btw sdk has this open async method but the problem the problem with open async is if you had an account with let's say 50 or 100 collections opennessing would prefetch all the information for all the collections so um regardless of what you are going to do in this particular application so let's say you are creating an application that just uses one of the collections right if you call up anything that's going to take a long time yeah it's going to open a bunch of connections or populate a bunch of caches that you probably won't use so for v3 the approach was uh you know don't do an open async but have a similar api in which you actually create a list i think it's if the api is so it's uh it's a list of couples so it's kind of like list of a string string let's go with a containers and let's have this using up here it's a list of tables and then you should do something like containers dot add and then here you put the live tv which is my database name and in this case it's ep 13. so i add this container or all the containers that i want to use in my application and i pass this guide to here okay and it should be and await so the idea is you can call this during the initialization phase of your application where you're constructing the customs client and this will give you back not only the cosmos client instance the later you can use for the operations but it will also pre-populate all the caches and do all those http requests in order for your application to be ready once the first uh actual operation comes in so you don't you don't pay the cost of that call start makes sense makes sense yeah so those are two ways that you can actually work around the call start but just if you're not doing any of those then know that the first request will have high latency and the reason is well we need to fetch this account information and routing information otherwise we cannot do anything er on the sdk is that uh is that is that is that parameter um optional what if i if i could i just call that with nothing in there and it'll initialize all everything no no that parameter is not optional and the reason for that is uh most of the use cases where we've seen customers complain about open async or use open async they were never interacting with all the containers make sense so i mean we could maybe if if customers uh give us our their feedback on on our github repo that they would like to see this api but we just you know it raise everything like give me all the containers i i this is a use case for a business case for me and i would i wouldn't like to keep you know having to enumerate everything then that's something that we can actually add but it would be great to to for users to show that feedback in in our github repo yeah like you said you could just make a fake call just doing a point read or something and then that'll basically do the same thing so yep yep yep that's awesome yeah now the the other um high latency scenario that is really common is uh let's assume that you have a cosmos db account with two reasons right and since you are building an application that is globally distributed you mostly have your application also deployed in these two regions where you want to have local users access the information so the concept here is you want each region to access the or read the information as fast as possible from the closest cosmos db a copy or replica now what happens is normally if you just create the sdk just passing your keys what what the sdk does is it by default connects to the primary uh region which is if you go to the azure portal and you see the list of your account regions it's kind of like the first one so that that's kind of like what they what we call the the primary region if you don't specify anything on the sdk other than your keys or your connection string then the sdk will always connect to the primary replica and if you are deploying your application to these tuitions let's assume that it's the same code that you're just deploying to region one and region two passing the connection string then both region one and region two will by default connect to the primary region which is fine in terms of latency for an application running in region one which is the primary region but it's not so great for an application running in region 2 because it's going to see a higher latency due to the obviously cross-reaction network calls so how do you solve this before that is how do you troubleshoot that and normally what we have in the sdk which is something i i i shared before is in the diagnostics you will have all the network requests stamped with the latency and you will see which is the reason that it is contacting both on the actual diagnostics json and also on the diagnostic property that listed the contactor regions so you can use that information to understand okay this application that was running in region 2 it is actually hitting region one when trying to execute this request something is not right with the configuration because i wanted this request to be hit in region two now how do you fix it is the both also the java and the sdks they have apis that will let you hint or configure the client to tell which is your original preference now the the sdk has two options one is called the application region and the other is called the application preferations and the biggest difference is the the application region is meant for you to tell sdk which is the reason this application this application is actually running on so let's assume that i have an application running in west u.s i will set the application region on west us and pass that as configuration on the sdk and what the ck will do is it will use that region and it will create a map of all the regions that are uh geographically closer to west us2 in this case and then try to connect to those regions in your cosmos db account so if your cosmos tv account has for example west u.s and east us or central u.s and you pass application region west us then this client will preferably connect to west us to perform the operations uh now if if you didn't have a cosmetic endpoint in west us then it will use the closest one that is available in your account but using kind of like the the the distance or the sheer african distance in between regions the other option is defining your own custom preference in in terms of regions this first option it's basically we will populate the preference based on distance but maybe you have some business scenario where you want to have some prefer a order due to some business logic that it doesn't correspond to regional or or geographical distance so the application preferations will let you pass a list of regions that you want to hear the sdk that it should use whenever it needs to perform operations and the order is basically saying if voice us is available let's use west us but if not if it's not available then i'd rather use central us and this this is also available in java with the difference that does shower doesn't have the first the first application region uh it only has the list so you just need to specify a list of one or more regions that you want to preferably connect to and this will tell the sdk which region should it use to execute the operations and not just lead it to result to the primary the primary region uh now the other most common case of of high latency is retries let's assume that i'm performing an operation i'm calling the sck and saying you know create item so the sdk will perform that request on the cosmo tv service now sometimes there might be some error cases or error codes that the service might return to the sdk for example the the known 429 which means you are going over the provision ru or the provision throughput so when the service returns these error codes and we'll see a small list later on but basically when we return this error code to the sdk the sdk will retry internally and it will keep retrying until it succeeds and when it succeeds when the service says okay i mean this operation succeeded you have an okay we'll return the the result to the caller from the application perspective in this scenario it might seem that the request itself took longer i mean if we measure the the amount of time that it took this this whole operation to resume due to success or complete it took longer than if we didn't have any retries but how do we spot these retries or how do we spot these these internal retries that sdk is doing in order to a complete its operation with a higher latency and we again go to the diagnostics if we open the diagnostics for this particular example or this operation we'll see something like this we'll see multiple network interactions each with maybe a potential different url or or or time but we'll see the the status code of the responses so in this case we'll see that the sdk did perform one request but there are the the response code was too many requests which is 429 so then we'll see the other request on the sdk that has eventually had an okay so with this diagnostics we can we can obviously say okay this particular request took longer or had a higher latency because i had retries now what what should i do when i have this retries well it kind of depends on the nature of the retries we'll see some examples later on in the case of a 429 it might just mean either increase the the throughput that you have provisioned because you are exceeding the throughput you also have the option of using auto scale if your if your workload is spiky and it's not constant you might want to use auto scale to define some boundaries so when you do have the this spike in the in traffic you are not hitting these 429s because auto scale is increasing their use for you in order to absorb the spike and then it will go down as your usage uh you know decreases uh the other option is obviously reduce the the the workload or perform less operations in order to maintain your workload in the in the machine that your provisional you or prohibition throughput lets you perform but the whole point is the diagnostics will let you have this insight and this will explain why that particular request took longer and the other most common error scenario that we see are timeouts as we've seen in the previous slides we can do we can have http requests and we can have tcp requests but you know this is just like we don't know this is just a machine in a box and there are wires that go somewhere and these wires go through other infrastructure and eventually heat or or reach the cosmos to be end point right and there there is always things that can go wrong in terms of infrastructure so one thing that that if you had to take away from this talk would be ideal is when we are building applications that should be resilient or fail tolerant you we always need to keep in mind that this timeout can happen uh and you know having one time out in an hour is something that might happen uh depending on the the whole network infrastructure it could be a failure on the machine that i'm actually executing the operation it could be a failure on a on the wire it could be a failure in a router somewhere in the middle there's always a possibility of a failure so when i'm building these applications i should take timeouts or connectivity timeouts into consideration when i'm building my business logic now it is expected that maybe one or two or very few timeouts happen within an hour or some period of time yes what we don't expect is to have a really high amount of amounts when we are seeing like 10 percent 5 percent 50 percent or in the worst case scenarios a hundred percent of my rico's failing due to timeouts then that's when you kind of wanna dig in and see what's up i mean this isn't expected now these timeouts that are exceeding the the the common or expected threshold tend to have two you know two two most common sources the first one is cpu and you might wonder why high cpu is might be a source of time out well cpu is is a resource right and you're executing code and that code is actually wanting to put some you know bytes or some payload into a into the wire now if your cpu is overloaded uh that operation of putting that payload into the wire might take longer than what we expect and when it happens well you have a timeout because we couldn't just process that request in the expected amount of time now how do you diagnose a high cpu timeout there are a couple of of tools there one is obviously again go into the diagnostics what we do in the sdk is we snapshot the cpu usage every 10 seconds and we stamp that information in the diagnostics of the operations so you will see is something called the cpu history and it might have a couple of lines like this and so every 10 seconds we stamp the cpu now if you see like in this example this all these stampings showing 100 cpu 70 cpu or something along those lines then this timeout is potentially due to cpu being high now why cpu could be high well it might be some bug in the application some code that is actually rising the cpu usage above the expected level it could be the application you know processing a higher workload that you normally have so higher workloads higher amount of operations more resource consumption the cpu is getting hammer that is that is one potential scenario the other scenario that happens mostly on dot net is something we call thread starvation and we normally diagnose it also by looking at the diagnostics but when we see the the cpu history or these stamps not happening every 10 seconds uh this is something that might be hitting a thread acceleration and the reason is the the the component that is measuring cpu is totally uh independent from the sdk itself so that there shouldn't be anything on the ck blocking that component from taking measurements but if your application is experiencing some deadlocks then that could cause thread starvation and that could cause that this component to take measurements in in times when it's not expected so maybe your cpu is not high but if your measurements are not every 10 seconds something is up on your thread pool normally what we advise is in these cases there might be some code on the application executing async operations by blocking threads or commonly in the in the dotnet wall is execute sync over racing which is take the async operation like i had in in my code sometime like let me undo all these things it's gonna be something like [Music] if i take a this create a item operation and i do something like dot result or dot wait or get a waiter get result those are common bad practices that might cause thread distribution thread exhaustion or deadlock so you might want to avoid those things when you are working with any async operation in your application with sdk or with other async code that you might have in your in your solution but these two things these two ways in the diagnostics are you know things that we use to troubleshoot any any issue that the customer might be encountering and the other obvious is if you have a diagnostics on your on your environment maybe you're running international vm or azure app service you have cpu measurements there so you can use those metrics to understand if you are in a high cpu scenario the only comment there is you always want to measure the max cpu not the average because average always tends to hide spikes so if you have access to those metrics and you are experiencing timeouts more than what is normally expected then check those metrics check the cpus as the max cpu usage on the instances and see if you are hitting these these spikes or not and that might be a cause of the demos now if cpu is fine and we have no issues there all the measurements are normal the other possibility on the client side is connectivity timeouts and this is something that you also mentioned earlier uh mark about the snap port the the famous snap port exhaustion so in the in the sdk in the diagnostics you will see this as transport exceptions so it's transfer reception is our way to say there was a an error on the tcp connectivity and these were the addresses or the ports that we tried to reach to we couldn't connect and this whole thing you know it's a timeout because we couldn't connect we fail and you know this is the diagnosis that we have available now how what can you do from the client side to to understand what was the reason for these timeouts when we are diagnosing client side or an environment of the client and the most common common causes is the first one is the customer is not or the user is not using the client as a singleton singleton means you have one instance and that instant is reused across the entire application lifetime you are not creating multiple clients uh across the lifetime of the application i've seen customers or users create a new client every time they want to perform an operation the problem with this approach is every new client opens its own connections so if you are creating multiple clients each client keeps opening new connections and new connections and new connections so imagine that the clients open 10 connections every time that you create it and you keep creating you know clients and clients and clients and clients and clients at some point in time you will reach the limit of amount of connection that you can open and you're done basically any any new connection that can that wants to be open will come out depending on the environment that you run your coding as your vm as your observers as your functions kubernetes or whatever is the environment there might be a limit on the amount of connections and the limit may vary depending on the machine size some machines are bigger than others that um some machines have more connections available than others so this error might hit not right away when you start application it might hit maybe a day after you have the application running or a couple of hours later so the first and the first thing that you need to double check is is your application maintaining the a client as a singleton if if it's not then make sure that it does the other point is like i mentioned different instances have different connection limits some azure services have limits that others do not uh and in those cases in there are some environments like azure functions and consumption mode this slide has a typo it's not consumption plane it's actually consumption mode the number of connections might be rather small so in those cases maybe it's better suited to use gateway mode but that this is after your la ro you rule out the fact that you are obviously on a single tone and you are within the connect or you are exceeding the connection limit and that is expected um in some azure services like azure vms if you make if the ip of the machine is private uh then this is enforced and and it's also called snot but if you make your your ip public i think that limit is also lifted or if you have private ip but you are using a vpn or service endpoints that limit is also is also lifted uh but it really depends on the scenario the first thing the first thing that you need to do to check is do you have a singleton client yes are you uh are you you know in an environment that has limits which are those limits each environment might have metrics maybe you can access those metrics and see how many connections are you opening up and also the sdk has a couple of uh configuration knobs that you can change on certain conditions for example if you expect your workload to be not constant but rather spiky like you you have you are performing some operations during the day you have a busy time but then the operations kind of died down um setting the cosmos client options poor reuse mode to private pool uh and the idle connection timeout on the cosmos client options will make the sdk after the timeout that you specify in the idle connection timeout will make with the basically we start to close those unused connections so it's it's kind of beneficial if you want to have a kind of control a soft control on the amount of connections uh to kind of close the ones that you are not using what's the default for the connection timeout if you don't set that forever so basically the the sdk is meant to to optimize the scenario where you want to have the lowest latency yeah so unless you specified them out it keeps those tcp connections opens for open forever right so with that obviously you don't pay the the the cost of opening a new tcp connection but you have to maintain the connection open all the time so depending on the scenario you might want to set these connections to to these different values the time the amount of time this is an example you can adjust i think the range is something between 20 minutes and days and the reason you don't want to set it too low is because obviously you you don't want to eagerly close connections that you might have otherwise reuse in in a short period of time and and again again i i i repeat this one because it's like or not but i figured no because this we see it so many times right so yeah it's the most common one i've seen users use patterns like repository where they are super sure that they are creating a single tones for everything but the client is is not a and it's kind of like leaking the creation somewhere so just be double sure triple sure that you are maintaining a singleton client uh the scenarios where you might want to have multiple clients if for some reason your application is contacting multiple uh cosmos v accounts like you need to open two clients because each client needs a connection string and the connection stream is different for for each account right and that is expected but if you are just working with one client with one account then there is there isn't a reason for you to be creating multiple clients now now that we cover the the most common error cases on the client or how to troubleshoot them there might be some cases when the actual problem is on the service and and it would be nice to and you know kind of know which cases are what what's what is expected and what actions you can take to on these particular situations now we we saw that there could be high latency scenarios and we saw that there could be you know potential problems on the on the client there could be retries there could be a latency on the wire maybe you know we put the request on the wire the request took longer to complete and that's the reason for for the high latency now how do we tell apart uh you know the actual cosmos db backend from the actual latency on the wire and that's something that we expose in the diagnostics the diagnostics have this property called the be latency in ms right it's kind of back end latency in milliseconds and this basically shows what what was the latency on on the actual cosmos tv packet now if you see requests i think i still have the these guys here so this is one example diagnostics right where the backend latency was super fine so we have we see back in latency less than five milliseconds right but the overall latency of the request might have been higher right now the the reason is the reason here is if i see the request taking longer but i see the back end latency being very small then i know the problem is somewhere in the middle i know that the backend wasn't the issue and i know there is a wire in the middle so maybe it's a problem on the network in assured there are support themes dedicated for network issues so that could be one point of contact but the other scenario could be maybe the back end does it is actually having a latency problem how do i diagnose that well in the diagnostics i will see let's say that this operation took 305 milliseconds and in the diagnostics i see 300 milliseconds here in the vacuum latency so now i know that this document read operation for some reason took 300 milliseconds and the problem is the back-end now i have evidence and i can open a support ticket and understand what is going on on my account and these same metrics are also exposed in the azure monitor or the metrics in my account i can actually track the server side latency for different operation types during period of time so i can build monitoring on top of my cosmos tv account that is not particularly in my business application so i can monitor this these things both on the application and also on a monitoring layer and i can act accordingly when i see this latency you know going above what is expected now this this latency can also uh generate timeouts i mean i can have a high latency which might not generate the amount but maybe when the latency or the bike and latency goes beyond some threshold that surfaces as a timeout and that is you know one of the cases for for backup latency you know uh sorry my my my english fails me sometimes that is something mine too it's okay that is that is how a bucket latency can reflect in two different ways either you have requests succeeding with high latency but you might have timeouts also on the on the same on the same boat so again timeouts can happen on the client side because of client side issues like we have already seen but timeouts can also happen due to something on the back end and that could be also service service latency now on the service side the common error responses that we've seen uh are kind of mapping this table and what i try to do in this table is which are the the error codes which is kind of the meaning and if these are retried or expected to be retrieved right yeah on on the application layer so now people should take a screenshot of this i'll leave this up while you're talking to it so people can take a screenshot so they can look at these uh these meanings in here yeah and and like like you said i mean all these slides will be available on the repo for the for the live tv content you can just download them later and and click on those urls and and check it out yourself but the idea here is just let's just go through the common ones and do a quick understanding and if we as a user as an application builder if we should add any type of logic to handle these situations and the first one is a 400 which is bad request normally is there isn't any case where you should retry a 400 that means that something is wrong on the payload it could be the payload you are sending is not json uh or the payload you are sending has an invalid partition key or is is missing the id is missing the the one of the required properties so all this will translate into about request and retrying won't really make any difference a 401 mis means invalid authentication so maybe it's the keys you're using are invalid maybe you are using the keys that you thought are correct but someone rotated the keys on the account so you are now getting a 401 retry in a 401 normally is not that one won't change it i mean your keys are invalid and there isn't much much to do about it uh 403s mean that the the operation has been blocked by access control meaning you have some configuration in your account firewall or bpm that is filtering requests coming from a particular source for example you can you can create your cosmos v account and say i only allow requests coming from this subset of ips or these vpns or these private endpoints if a request comes from outside of these known sources then they get a 403 now why did i put a you could maybe retry this these 403s is because when you change any of these polysize it it might take something between 5 and 15 minutes to apply so if your business scenario is is meant to constantly or periodically change these configurations on the account maybe you are you know periodically changing the allowed pips or the the vpn configuration maybe you want to retry on a 403 because it might take a couple of minutes to reflect those new ip or those new enable ips normally you wouldn't but maybe in your business applications you you you decide to retry on a 403 404 means the resource you're trying to read doesn't exist there isn't there isn't a point in really retrying anything here i mean the the resource really doesn't exist so there isn't any more about it for a wait it's a timeout like i said before timeouts can happen in a normal in a normal a scenario where everything is perfect timeouts can still happen now the sdk already internally retry sometime out for read operations so when you are issuing a read which is like a read document or or a query basically if it faces a timeout on the connection it will retry internally a couple of times now if if the timeout surfaces it means the sdk retries were exhausted you can and it is advised to actually have your own retry mechanism and retry on timeouts again because retries it sorry timeouts can happen and and if you are building a resilient application then you should keep retrying on timeouts the only caveat or the only scenario where you might want to uh you know decide not to retry is for right operations and the reason is imagine that you're on the wire and you're doing a right operation you put the request on the wire and that request is traveling on the wire to the account and we are on the ck get a timeout because we didn't get the response in the expected time the problem is the sdk won't retry a timeout on the right and the reason is we don't know if the payload that we should send reach the back end or not maybe that the request got a timeout in the middle of the wire or maybe it reached the back end but the problem was sending the response or receiving the response so that's why the sdk doesn't retry on on a right timeout on your application you might decide to do a a retry on the on the right but it should be advised that you can expect two possible outcomes one is the right actually succeeds because the timeout was kind of in the middle it never reached the back end but the other possibility is that you get a 409 or a conflict for example because the document that you just created uh these are did actually hit the back end but the time mode was on on receiving the response so when you retry the right the right hits the back end and it sees for example that the document already exists excellent so you're going to want to nest uh those status codes basically exactly exactly like i said i mean you can retry on reits no problem you can retry on rights but if you're retrying rights expect that there could be some photonics and it should go without saying that if you're writing apps in the cloud across a win you need to have some sort of retry everybody should be using poly uh in their application making requests across the land it's just this is minimum uh stuff you gotta you you have to do this because uh it's because the lands are flaky you just don't know what anything can happen between the time that code leads your sdk and your compute instance to when it re hits either our gateway or our back end so um exactly i mean there could be this is a distributed compute system right you're writing a distributed application so you need to keep in mind and write write code that is defensive against any potential failure in any of the components yeah the other code that we know is for 12. for 12 it happens on what we call optimistic concurrency not everyone uses optimistic concurrency but what this does is it um it lets you do a a replace or an update of the document um that enforces the fact that let's say that you have let me rephrase it let's say that you have an oblique an application that might be writing on the same or or updating the same document from multiple places at the same time congruently right and you you want to enforce that any any change that you do is kind of atomic that you are you are not over stepping in in the change or or replacing the change that another part of your application or another application just this on the same document so the optimistic concurrency pattern is you read the document you capture the e-tag of the attack property of the response you apply your changes on that document and you issue the update passing the etag property now if there were no other uh no no other changes in the document between the time that you read your applied your your update logic and you send the update then that operation will succeed but if there were some there was some other concurrent operation that override overwritten the same document then you will get a failure with this particular fault virtual status code what you should do is you should retry the operation by doing another read of the document to get the latest state again capture data apply the update logic and then retry the update uh operation on the on the sdk but that's i mean that that's why you you should still retry on on for 12. yep for thirteenths is basically you are exceeding the max document size there isn't any any retry in there if i remember correctly we have a two megabyte document limit so if you are trying to store something really big this will fail and there is no retry that will save you yeah the only thing you can do is uh shred the document and uh and then delete the original 49s are throttling the sdk already retries this internally up to a particular amount of time that you can change using the cosmos client options uh but normally we we retry and if you get a surface of the or four to nine error code or reception it means that the sdk already exhausted all your retries internally and it's throwing the error back to you you can either increase the configuration to retry more times or have your own like you mentioned mark you have poly and you are deciding to keep trying on fortune 9's because you want the operation to succeed regardless uh ideally what i would do is try to measure this fortune 9's and see how many i am i am getting because ideally i i i wouldn't like to have any uh so in those cases where i'm getting too many it's a time to again ponder is my throughput is my profession provision throughput enough for my workload should i reduce my workload use autoscale increase the throughput those are things that i should evaluate if i'm hitting too many for two nights they're okay once in a while because at least you're you know you're saturating the throughput you've been provisioned so it's not you don't want to avoid them at all cost but right you're right there is a tipping point where it's like okay uh something needs to be addressed whether it's increasing throughput or i need to look at how i'm accessing my data and how i'm doing these operations exactly yeah the other goal is four four nine four four nine is retry after it is it is not common uh but happens when you have multiple right operations mostly using stop procedures on the same resource concurrently the sdk again retries internally on these guys but you might want to also retry if you are building your own retiring layer and it's safe to retry on those the other we have two left one is 500 so 500 are kind of like a weird ones uh normally this means that something is really wrong uh and you don't really expect this now some users have been asking me should we retry on 500 and and the answer is maybe so the problem with the 500 is for the sdk it's we have no clue what's wrong i mean we only know something is not right on the backend uh retrying might help maybe i would say maybe one retry on a 500 because maybe this 500 is something temporary and we are we we are trying to build uh you know resilient applications so with that in mind having a one retry on a 500 or a very limited amount of filtration on 500 might help in the in the case that this 500 is something transient that you know one bird just was you know hitting the or eating the the 500 why you have things like paulie because god knows what that was right i've seen this like what the heck was that and then we tried like it's like it was never there so right exactly so maybe one or every limited number of retries is actually just fine and the 503 means service and available so normally what we what this happens is we have a high number of tcp timeouts on the connectivity layer and the sdk surfaces that as a 503 how to handle a 503 it's exactly the same as the mode so if you are building some logic to retry on timeouts you can do the same thing as 503s because 503 is normally mean there was a high number of tcp timeouts we just couldn't connect at all to the replicas and you can safely retry i mean we retry internally but you can safely retry now again going back to the things that i said about timeouts having a one two maybe a couple of these in a fixed amount of time might be expected we are building a distributed application but if you are seeing like 50 of your requests 20 10 of your request failing with timeouts or services available then something is wrong and we need to start diagnosing that and talking about availability the last thing the last topic that i wanted to discuss is what happens on regional failures we know what might be going wrong inside one region regarding connectivity but when we start to span multiple regions then how does the sdk behave in terms of um you know failovers or if a region goes totally offline what happens i mean i will the sdk try to save me or save my application by doing something okay let's see what's what's up so let's assume that i have an account with regions a b and c and i have a an sdk client that is set with a regional preference like we've seen in the in the earlier slides we with region ac and b know that it doesn't map exactly the the list of regions in the account in the same order so i i start my application my application connects to region a it's performing rights and rates without issue but something happened and or not something happened but i go to the azure portal and i remove the region or let's hope not but the region goes completely offline now this this region going offline can surface on the client uh in two different ways depending on the sdk if it's if the c key is actually doing operations while the region is going offline or the sdk is trying to perform an operation while the region is already down and and it's not reachable now if i'm performing an operation and region is offlining or i'm you know the process of removing the region because i went to the azure portal and i click you know i remove the region and i click the save button so the region is starting to the offline in process the sdk gets one particular response from the backend which has a status code of 403 seven status code of 1008 that tells the sdk this region that you are trying to contact is offline is no longer available or is becoming not available the other possibility is if the sdk is trying to perform an operation on an a region that went offline and it's completely down or removed after some time we won't be able to resolve the dns so the the the ritual traditional dns for that particular regional account is not available so in any of those cases what does the sdk does it do is it contacts the game gateway it tells gateway hey gateway which is the account information now tell me which is the topology which are the available regions so if i just remove the the the a region a on the portal that will return to me region b and c if the region a is still part of my account but there is a original failover or failure sorry that region will be marked as unavailable on the sdk so the sdk one won't see it anyway so for the sdk it's gonna be region b and c so since i have a prefer rational preference of using region c or a as the next friction when region a is not available then the ck will connect to region c and retry that particular operation and all subsequent operations from that point in time forward so there isn't anything that i should do on the sdk itself uh the sdk will take care of detecting the failure and reacting accordingly when provided with the with the correct configuration and information now there could be maybe hopefully a time when that region a goes back online because we are going to reason c but maybe that's due to latency that's not the ideal i mean these are still two different attributions so maybe operating on region c is is having a a bit more latency than operation operating in region a and that while well that gives us availability because we are our operations are now succeeding it's not great from the latency perspective so what the sdk does is every five minutes it keeps asking gateway to know which is the state of the account which is the topology of the account to understand that maybe rich on a sorry maybe rision a is now work up is now available again so what happens is once the sec the text the text after a gateway interaction that the region is now back up online it will sorry i forgot to grab some water yeah give you some of mine right here so once it detects that region a is back online because region a is higher on the list than region c on the preference then it will switch all the traffic from that point onward to region a again so in a scenario where there is a regional failure and reason a went offline and we went to region c to keep providing availability there is a case where without the user interaction when region a comes back up or if i went to the azure portal and then i re-added region a the sdk will reroute all traffic again to reason a based on the original preference that i set on the client creation that's why it is super important that either through the application region or the application preferred reasons you are setting and describing which is this regional preference for your account to the sdk otherwise the sdk won't really know if you rather stay on region c or you prefer going back to regional now there could be another scenario which is particular to single master accounts single master accounts or single write vision accounts only have one of the reasons which acts as right region the others are read replicas right so let's assume that again i have abc and a is my single right region so all my rights go to region a on my client and because i'm setting the preference to reach an a a all my reads also go to region a now i go to the azure portal and for where for whatever reason i do a failover of the right endpoint or the right account or the right region i say okay i don't want region a to be my right direction anymore i want to it to be region b now what happens is once that process completes the back end when the sdk tries to perform a write on region a the backend will return an error saying http 403 723 which the sdk would understand as this right this region that you are contacting while while it is a valid reason it cannot serve right request so what the sdk will do is it will again contact the right the gateway discover which is the new right endpoint and redirect that right request to the new right endpoint note that region b in this case uh it's it's not it is not the second one of my preference or my preference list but because this single right region account doesn't have any other right vision there isn't any other alternative so my rights will will start from that point onward to go to region b but my reads can still honor the original preference they can still be served from region a because region a is still a red region uh now another scenario is session consistency so we know that cosmo tb has five different consistency levels each one has its fronts pros and cons and users should be picking the the consistency that maps their business requirements one of them is called session consistency and the tenant of cc consistency is you should always be able to read your own rights right so what happens in this scenario i have my rights go into region b and my reads go into region a now what happens on the back on the back on the back end is when i do a right and reason b that's right it's replicated to region a now let's okay one more slide so i let's say i do an operation on region b that that operation has a transactional identifier of two i'm not saying that two is what we use as to identify transactions but it simplifies the scenario so let's assume that i write a document that document has transaction number two and the backend in region b returns to the client hey your search what we call the session token which identifies which is a latest transaction that you just did has a number a numeric transaction of two so the sdk says okay my session is now two and when it tries to read the document from reason a it will take it will tell a hey i want to read the document but i want to read it in the scope of c of session two the problem is session ration a might not have yet received the replication of that operation from reason b because i mean we cannot beat the speed of light yet so maybe my read is faster than the speed at which the replication is happening so in that case ration a will tell the sdk hey that this document doesn't exist in this in this region yet this is a 404 with a particular sub status code that is 1002 when the sdk receives this particular error code what it does is it goes and retries that read operation with a on region b because it knows for certain that even though that might not be the your preferred locate your refrigeration but it knows that that document exists in reason b so it retries the request in reason b and then search the result if there is any subsequent request those requests or read requests those requests will still go to region a but this particular request that cut this particular error code gets retried on region b and if we go to the diagnostics we can see the requests going to the different regions with the different error codes that were involved so the point here is if we had a high latency request and we want to investigate why this request has a high latency this is one of the potential scenarios we could be hitting a replication delay in between the regions uh and with the seca is just retry on the other region getting the results for you to have availability on your application i get a question for you so the the sdk is what handles uh consistency for bound stillness and strong right so during the two replica reads is that correct so it's checking the lsn's to make sure that they match that's how we guarantee strong consistency or at least in region bound in region strong for bounded stillness yeah but let's leave that aside let's just say it's strong cons well no i guess let's talk about bounded sales because strong consistency is global quorum there's a different mechanism that's going to force that replication so that's not a client side driven that's back end driven correct but about this is a good scenario i mean so yeah so let's walk through that scenario again so i write into region b and i'm using bound to stale so i'm going to do a two replica read off of region a and i'm likely to get mismatched on both replicas or on or on one of them at least or let's just start with both not none of the data is replicated so am i going to get that same am i going to get that same code no that er this is this error code is particular to session consistency and the reason is on both the stainless when you set bounded stainless on the account you are acknowledging that your reits can lag in some particular lambda or some particular delta so it makes sense because it only applies to and it only provides in region strong exactly we're not doing in region reads we're doing out of region reads so okay in that case if you are if you're using bone and stainless hopefully you are building an application that when it's performing the reads knows that it can get a photo for like a plain 404 yeah from the other region and that might be expected in that case maybe you want to retry those for a force because that's within the bounded stainless configuration that you are setting now for when your sensation consistency that is the part of where the sdk kind of takes over because uh the session consistency is telling or has that that that then and that if i'm using the same sdk client i should be able to read my own rights regardless of what i'm reading right so that that's when the the different error code appears and that's when the the retry mechanism on the ck kicks into in order to honor the session consistency makes sense i mean that's the difference fundamentally from session consistency amongst all others is all the other consistency models are all data driven and session is unique in that it's a client it's a client driven uh consistency model thus the session token that you you're passing around between requests so yeah that makes sense and in the cases where you have maybe different components in your application where each company has a different client instance what you want to do is you you want to pass the session token decision token is part of the response of of your operations so if you want for example one component to write the document and you want another component to read the document but that other company will potentially have a different client instance you want to pass the session token from one to the other so the second one will perform the read but past the session token from the from the right from from the client instance that did the right that's how you extend the session scope outside your client instance are you saying uh basically making the session token durable outside the instance exactly i mean i mean yeah okay i just was gonna say i've talked to customers and others about doing that i think the only thing that could cause issues is is in high concurrency situations you could end up with contention on the single resource which can add latency because the clients are going to be retrying as they try to do sets or reads on the session token uh within there so you could you you could basically be working against yourself and that if it's if you have a high concurrency scenario you may be introducing latency if you if you try to do that yeah yeah the thing is either you do that or you know that a different client instance has a different session scope yeah and outside of the on a different session scope then you can expect a normal 404 because that other client has no recollection that there was a right original originally and has no information on what was the transaction identifier or of that right so this is a kind of like a blind read let's call it like that and a blind read is okay that's the document system or that or doesn't yeah and if i'm reading an origin on region a with that decision talking in this case i will get a 404 like a plain 404 and there is no retry on the second yeah and finally the the one is the the follow a five of threes which is temporary connectivity problems like you mentioned before 503s mean tcp connectivity issues too many amounts cannot connect so we surface the 503 now what happens is on that particular operation let's say i i did the right on operation 1 i try to do the read the read fails with the 503 what sdk will do is just for this particular read operation it will retry that read operation in order to provide the you know more availability it will retry the grid operation in the second preferred preferred region so if my preference is region a and c it will go to c and retry the read and hopefully succeed because hopefully this was just a you know a hiccup in the wire to originate if this succeeds the next three that will happen will just keep going today so the the the sdk retrying on um on 503s is a what we call a request scope retry meaning this request will only get retried and not the other request unless the other requests are also hitting 503s um so this is another reason for you to actually populate your preferred your original preference because otherwise we this particular retry cannot be done on the second if you are not populating the the original preference we cannot retry on a 503 you get a 503 and you decide what to do if you want to restrike yourself or not but if you know if we know your your original preference we can do one uh one retry to know if we can you know save the operation and and and provide with availability on this on the sdk itself you don't have a preference do you i mean you can set application region in which case we'll read in kind of the array of next closest regions and fail over appropriately or you can say no i want you to read over i want your favorite in this order right at least choose one of them right exactly okay go going back to i think here was a slide i think what you're saying is don't don't not do that do this one yeah whichever is is best for your scenario but do one of those please yeah i mean many for for you for your own uh you know sake as application and and developer uh because we will if you if we have that information on the sdk we will try to you know provide you with the best availability possible only in those configurations if we are kind of blind it makes the whole thing much much difficult to to to to help you with right so in conclusion the things that you i want to do to take away from all this presentation is please always always please you say client a singleton make sure that you are not creating multiple clients if you want to avoid connection issues connectivity issues snap or connection exhaustion define your original preference when interracing the client which is the things that we were just saying like one minute ago please use application region or application preferred locations to hint the sdk which regions you want to connect to again there could always be timeouts you you need to define some sort of retry layer on the on your application for some of the scenarios that we already talked about through particularly on timeouts because we this is a distributed compute system things can go wrong in any component in the middle capture diagnostics on errors and or on a high latency scenarios so if you are measuring the latency on your operations with your own stopwatch or by using the elapsed time on the diagnostics uh capture those those diagnostics for the operations that are going over some limit you can use that information to troubleshoot that or we when you're in a support incident we can use that information to troubleshoot it for you and on errors i don't see a reason why you wouldn't capture the diagnostics and regarding the the original failover which is a very common question i mean what happens on my application when a region goes down then review the documentation that we have available it's basically it goes over all the scenarios that i mentioned on the slides if you have any questions on that area ask us and we'll make sure to improve the documentation with any feedback that we receive i wrote a bunch of that stuff and rewrote it and rewrote it it's uh it's hard uh trying to explain that stuff in a way that's clear for people i mean you want to you want to be as concise and as clear as possible and what this is behavior is when a distributed database is going through a failover right so yeah are your applications gonna behave right so i mean that's why they're using cosmos is we can survive that kind of thing so exactly and it's all live documentation as the service improves on the and the libraries evolve we can improve that information or or there might be some scenarios that we are not clear about or maybe we are missing on the documentation and you know you you as a customer you bring the question we answer it and we say oh i mean i'm sure this can help someone else so we add that to the to the documentation absolutely that's uh we're always trying to improve our docs and make things more clear with that so just a couple of comments or questions here just uh someone saying thank you for such a detailed session there i agree this has been quite a session there uh someone else uh you guys planning on updating our examples on github i guess that's all the sdk samples that we've got there yeah those have kind of been sitting around for a little while i guess maybe we could uh i don't know um i mean we were discussing in fact in our sdk scrum we were actually discussing about adding some samples regarding this retry logics that applications could could use maybe using poly uh to have something as i mean if i'm starting from scratch i mean do you have anything that i can use as a boilerplate i mean if i use this sample code that you are providing me it covers the most common retrial scenarios and i can add on top of that so that's one of the things that we are discussing and obviously adding more documentation that explains which are all the different scenarios on when you should retry and when it doesn't make sense i think there's a lot of good docs on paulie and stuff that we could maybe just kind of just suck into our stuff and create like a here's here's some stuff you know you need to kind of do before you go into production right and one would be implement retry maybe implement some sort of logging with the diagnostics um i don't know yeah it's kind of a slippery slope uh you know like okay i'm going to capture diagnostics and then i need to somehow persist those somewhere uh now you're kind of out of the now you're out of uh you're kind of out of the thing so uh oh another question uh here uh neta for 4.0 for what sorry dot net sdk 4.0 uh so follow if before before it's kind of like a it is an experimental the only reason that you might want to attempt to use p4 is because you want to use system take json one thing that a users often don't know is that you can use system take json with b3 sdk um so i actually have a national friday recording uh with that uh i don't know if you can can we share some links like right now uh let wanna see if i can find it hold on if not i mean i can just add it to the to the folder in the repo kind of like other links or maybe other slides to this presentation and update it on the bulk and transactional batch operations that's not it no no but yes anyway there is no ata on before right now because we are focusing on hardening and improving b3 uh to be the most solid base when we do go to b4 so if your question is do we have a date no we don't now i would i would ask you which are the reasons that you are looking at before if you're looking at before for temptation we we can do we we can leverage system touches on on b3 i will make sure to post some stuff on twitter uh so if my handle is at e-a-l-s-u-r uh so just ping me there and i'll make sure to to send the yeah and retweet it as well you can just if you hashtag it with the azure cosmos tv i will see it and retweet it um but yeah you can you can watch that and then figure out how you can use system.json um yeah the other case for for b4 is maybe you want to use a async enumerable in dot net that is something that we are also looking for adding on b3 so the transition from b3 to b4 is going to be much simpler so that would be something that you would also have on b3 so i don't see any other scenarios where you might want to use before right now it's still there but it's experimental not currently recommended for production usage yep definitely not uh definitely not ready yet so um great well let's see i want to we are working on this yeah right right that's right that's right uh if people are interested in seeing um tia's deck from today's presentation which has all of those pretty awesome uris as aka msuis for all their http responses you can go to this url here just look for the folder with episode 13 advanced.net sdk and you can get those uh within there i think i've got one more question maybe no just will say and sweet cheers so good well fantastic this was uh quite an amazing uh episode this week matthias uh thank you so much for joining us uh this is great thank you very very much thank you for the opportunity for me it's a pleasure yeah that's great well everyone uh thanks for joining us this week uh next week we're gonna talk about uh our new continuous backup and point in time restore feature i think we announced this i think at ignite winter ignite so that would have been like back in february uh so come and check out you're gonna join a teammate of ours gomen and he's gonna talk all about uh all about this new feature so for uh myself and matthias here uh thank you very much for joining and we'll see you next time bye bye bye [Music] [Music] you
Info
Channel: Azure Cosmos DB
Views: 1,133
Rating: undefined out of 5
Keywords:
Id: McZIQhZpvew
Channel Id: undefined
Length: 99min 43sec (5983 seconds)
Published: Fri Jun 04 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.