Troubleshooting Common Scenarios with Always On - MS SQL Tiger Team

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

John is my slide deck visible yep you are good to go okay great thank you so today I'm going to talk about some common scenarios with always-on availability groups the most common scenario that everyone Fay everyone faces is trying to tie down what latency is plaguing your availability group everything is working fine then you don't need to be bothered about it but during a reactive situation that is one of the most common reasons Microsoft engineers are asked to assist our customers the second most common reason why customers call us regarding availability group issues is associated with feel overs or cause for failures so that's another thing I'm going to cover today before I go ahead I'll go and quickly introduce myself I work as a senior program manager with the Microsoft sequel server product group the tiger team I've been with Microsoft for nearly a decade in the past and this particular virtual chapter I did another session on some of the support ability enhancements that my team had shipped for troubleshooting some of the common issues that I'm going to talk about today I'm going to briefly touch upon them just to set the context and feel free to ask any questions over the chat and I'll be more than happy to answer them so let's talk about latency in an availability group if everything's running fine you're good but if it's not then it can be daunting to troubleshoot latency during a outage situation or even during a slowness and all of you have heard of this added slow and steady wins the race when it comes to availability groups that's not really true it's pretty much the fast and steady wins the race as long as you have a steady flow of traffic which is predictable which you know performs at a steady pace and is within your SLA you know that it is going to be business as usual the moment the pace falls down that's when you need to be bothered there are two ways to look at an availability group the first one is to look at the dashboard in the management studio which gives you a fairly decent view off the availability group itself it shows you what the availability group can do in terms of throughput it will give you the status of the availability group the logs and rate the redo rate there are some nuances about the dashboard which I'll talk about when I'm going through my demo but one of the important key things to keep in mind is that if you are looking at the log send rate and are confused about why it's showing values which don't match your throughput then either you're missing the particular KB that you see on your screen right now and you have some amount of information which you need about your log generation rate we will talk about how to compare the logs and rate and the actual log generation rate on your sequel server instances when I'm talking through the demo now the important thing to keep in mind is when you have more than one replicas the amount of send that the primary replicas has to do is more than one and when we are measuring the log send rate in the management studio it's a time-based average whereas the counter is which a number of these monitoring tools that you have either deployed by purchasing off the market or you have built on your own using perfmon counters and DMVs track the throughput at a per second interval so when you're comparing the dashboard which uses the DMV and the peripheral encounters there might be a difference which could be confusing at times so before I go further let me talk about sequel Server 2016 and how your throughput which if it's not performing optimally can result in latency in your environment in sequence or 2016 there has been a big amount of improvement in the availability group transport layer so we use about eight worker thread context switches to transport one log block which means the moment you do a transaction when a lot of luck is generated on the primary replicas we can transport it to the secondary replicas in 50% off 50% less for context switches so which is a huge gain in throughput what some of the changes that we have done in terms of log throughput log transport layer and applying the logs on the secondary for the redo we've been able to achieve in 95 percent of transaction log through per trade with a single synchronous secondary now if you look at the graph over here on the right-hand side of the of the slide you would see that the blue line corresponds to a standalone replica which is delivering transactions at a certain rate the orange line which is a synchronous replica without encryption is able to keep up for the most part at the same pace as the standalone instance which is functioning without an availability group configuration so what this says is in 2016 you're all TP workloads would perform as optimally as possible up to 95 percent off the transaction log throughput as compared to when it did not have available T groups configured if you compare that with sequel server 2014 with the synchronous replicas without encryption the throughput as the number of concurrent workers increased dropped quite a bit that can be challenging so if you have if you compare the extreme ends of the spectrum if you have 400 odd concurrent threads running the difference between the if a let me use my laser pointer here the difference between the peak of the stand-alone replica and the peak of the synchronous replica on sequel server 2014 tends to be quite a bit and that could definitely lead to perceived latency issues because if you were functioning at at this particular rate at nearly 300 megabits per second and if you move to about 88 you would obviously start raising some alarm bells now compare that with synchronous replicas with encryption we're still able to get a pretty good throughput which is the gray line even though we're using encryption and a synchronous replica so all so in summary sequel tour 2016 performs much better from a log throughput and performance standpoint so if you're considering looking at sequel server and you if you're looking to build looted groups then I would recommend using sequel to or 2016 moving on if you had attended my previous slide sorry my previous session at this virtual chapter I have presented a view like this what you see in on your screens right now which basically shows you the primary replicas commit time and it also shows you the statistics from the secondary replicas which allows you to track down latency to a fair degree of accuracy what that is going to do is it's going to allow you to track how much time is it taking between and time period for the primary replicas to actually commit and if it is running in synchronous commit mode then you would also be able to track down how much of time does it take to go ahead and wait for the second replica to send send response so all of this is is basically a culmination of the eight troubleshooting enhancements that we shipped in sequel server 2012 service pack 3 sequel server 2014 service pack 2 and in a future update for sequel Server 2016 we would have these extended events there as well any questions so far okay go ahead so if I had to drill down into the bottom half of the table which is actually showing the hard numbers the commit time basically tells you how much time it took for the primer replica to commit a transaction the remote hardened time is basically telling us the average time it's taking for the second replicas to commit or basically how much time the primary replica has to wait to receive or hardened response from the secondary replicas and on the secondary replicas side it tells us how much time is it taking to apply a transaction and how much time it takes to receive if I look at these numbers it's very evident that the secondary replicas is basically applying stuff pretty quickly and it's receiving stuff pretty quickly as well so there is not really a lot of latency on the second replica the primary replica is sending across data pretty fast but it seems to be waiting for the remote hardened messages to come through so if the primary is doing its work like the way it should be and the secondary is doing the same then the third component in the picture is the network and when I had captured this particular data snapshot I was introducing a latency on the network using a tool which would apply a 50 millisecond latency to every TCP packet that was being transferred so which is why the network latency is showing up so while talking through the slide I was able to drill down into which part of the always-on availability group topology was actually causing the slowdown so let's go ahead and move on to the next slide now these are the perf mind counters that we added and as you can see you get a lot more information from the perf on counters associated with your compression rate your compression cash rate decompression rate if flow control is being applied whenever you're troubleshooting slow performance for your availability groups one of the things to look at is flow control which means that are too many messages being generated on the primary that the always-on replica has decided to basically halt our throttle the transactions so whenever there is flow control being kicked in that is a sign that your instance is not able to keep up you even get additional insights into encryption and decryption rates you also get information about descend and receive flow controls at the transport layer and other than that they're also a bunch of extended events which we were looking at in the previous slide so the graph was built using all of these extended events so the compression times and the decompression times came from the HADR underscore log underscore blog compression the and the decompression gives me the decompression time the send time is measured using the the hid are logged locks and complete the time it took to apply the log block is received from A to D or underscore Platinum scroll log block and the capture log block gives us the capture time so all of these extended events put together along with the duration it took for the event to complete gives us good information about the amount of time it takes for the primary replicas to send data across to the second replicas and for the second replicas to apply those transactions so let me switch over to one of my it sequel server instances which has a an availability group configured and let me go ahead and and show you what what kind of configuration I have so I currently have a do node replica in my cluster both of them are standalone instances I have a AG called test AG configured this is running sequel server 2012 and if I look at my availability group configuration I can see that the tiger a g1 sequel server instance is currently the primary instance and tiger AG 2 is the secondary replicas let me go ahead and show you the properties when I look at the properties you will notice that the the AG one is the primary replica and a g2 is secondary because I have mentioned the availability mode is asynchronous and readable secondary is configured for my replicas so now I'm going go ahead and try and generate a few transactions so let me go ahead and open up my sequel script which which is pretty simple what what it's going to do is it's going to insert a thousand rows into a table before I do that I am going to start up a perf Mun with two counters and what what that's going to do is for my environment let me go add the counters real quick and I'm going to add counters from my secondary replicas so the counters that I'm interested in are the log bytes redone and the log bytes sent so for some reason that's not showing up so just give me a moment before I pull up the counters and the reason for of the purl on as opposed to pulling it through the DMV system whisper font counters is because it doesn't really give you a perspective of the change or the pattern that you would see so it's always good to look at it in a graphical format so let me go ahead and add the counters that should be really quick I'm looking for the a g2 and I'm looking for the counter database replicas so I'm going to select the test AG I'm going to use the read redone bytes per second and I'm going to use the log bytes received counters I'm going to highlight them and I'm going to get rid of my processor time hmm for some reasons again I cut that wrong so again going back to database replicas and I've done redone by it's log by its received per second okay done okay so now we have those two counters added it's not I'm sorry it's not showing up because of the resolution here now let me go ahead and do the transactions and you get to see the counter spiking up so I'm going to insert a single row and go back I don't see a difference now let me go ahead and insert the loop and see if I see a small jump yes I can see both those counters showing up the log bytes received and the log read on bytes and you can see that there is a pattern and it's finished a thousand rows I'm going to try that again and we should see a same kind of small Plateau pattern showing up now let's see what we see in the dashboard which I was talking about so the sequel server failover - always-on dashboard gives you the redo rate descend rate and the last commit rate I you can right-click on hi Sophie hi yes oh and they just wanted to throw a couple of questions at you kind of based over the demo here one of the one of the attendees was asking about is perfmon the only way to get to these metrics or could you also get to this through the normal DM piece as well good question I'm going to reserve the answer for that because I'm going to claim okay exactly what the participant asked so the DMV is what builds up the dashboard and what you're seeing here the log send rate is exactly what I'm talking about so the log send rate is basically an average of the throughput that your availability groups primary replica is sending at during an active period now notice that right now even though my default refresh rate is 30 seconds my send rate is at a static of four 6:02 I'm not doing any transactions on my system I'm not doing any right activity but the important point to keep in mind is the last sent time so the current time on my system is 526 p.m. whereas the last sent time on the replicas is 525 so that basically tells you that the DMV only gives you information while the availability group is active whereas the perf one counter is a per second counter you see everything zero right now so so if you are measuring per second latency or if you are measuring through Porter benchmarking throughput within your environment then my recommendation is to use the perfmon counters what you seeing the dashboard is an approximation off your send rate while the replica is active during that period it's going to give you some accurate numbers so now let's do this for the next two batches where I'm inserting thousand rows and let's look at perfmon again I'm seeing those photo patterns now if I go back and check again my throughput rate has increased and my last commit time which in the next refresh it will update itself that would change as well so if you want instantaneous point in time reference and if you want to base alerts on them operational kind of alerts then perfmon would be the way to go if you want to make a decision of whether the replica is behind or while you're troubleshooting your environment then the dashboard gives a good indication of if anything is being sent or not and it's an average over the active time period now notice that now that the transactions are complete the last cent time has now been updated to what the current time is because the transactions just finished and now if I don't do any transactions on my replicas then again these times will continue to stay static and John I'm hoping that answered the question if not I could take follow-up questions right now yep no I says it does great so while we are looking at the dashboard and why we are looking at the send rate what I've also done is I've compiled all of this information into a availability group dashboard in power bi desktop and again you can see that all of this information is pretty much being pulled from the DMV and if I click on refresh right now what it's going to what is going to do is let me let me go ahead and add the data collection layer on top so what I'm going to do is I'm going to create a table in my temp DB and then I'm going to go ahead and start doing the same amount of inserts and let's see what what my DMV shows and using the power bi desktop dashboard we should be able to get a good picture now what I wanted to show here is while the data is being generated I wanted to show the first half of the graph what you see on the left which is log send rate and the redo rate at the log send rate tends to keep rising and this is what I've taken during an active period that's because this is a frequency at which the data is being sent so if you have bandwidth available on the replica this is going to keep rising till the plateaus out the redo rate is also going to be at a high till there is nothing else to be redone especially if you're running an asynchronous replica and the log send rates and the redo rates would continue to stay at a static number on a system which is performing optimally on a system where it's not doing active periods you would see a fluctuating graph which would tell you that there are some concerns with the performance of that particular environment so let me go ahead and stop the data collection let me go ahead and pull that information into the dashboard and see what we can find so this is going and connecting to my temp DB and it's going to load all the data into these charts and as you can see it continued to rise and once there is nothing to send it continued to stay static similarly the redo I did is in bursts data came in got redone data came in got redone and that pattern follows so when you're looking at the dashboard in conclusion when you're looking at the dashboard keep in mind that the log send rate is an average it's a indication of how your throughput is being measured while the instances actively sending data to its secondary replicas whereas the perf on counter gives you a point in time per second value and if you're setting it up operation alerts you're using perfmon if your reactively troubleshooting the environment then the always-on dashboard gives you a good indication and if you want to get really scientific about the data then the then you can actually create an extended event session using the events that I talked about earlier in my slide deck which gives you much more granular information up to microseconds of how much time it's taking to send how much time it takes to get a remote act from the secondary replicas and how much time it takes you to do the various tasks and between the session configuration is pretty straightforward all you need to do is use those events which I had pointed out earlier in my slide deck add them to an extended event session and what what you need to be worried about is the the time-based fields I already have like processing time the time to commit I already have these scripts uploaded on github once I complete the session I'll point you to the github link which has that information going to move back to my slide deck let me switch over and I'm going to move over to the next section which is troubleshooting fail overs so in sequence over twenty sixteen twenty fourteen service back to and twenty twelve service pack 3 we introduced a new event the lease renewal and the available lease expired event was improved for additional logging we also log additional messages in the cluster logs whenever at least time that happens so at least timeout is basically a hardly check for your replicas it gives you information of whether the replicas basically alive or not to put very simply and this happens every five seconds so if there is a time period where the replica is not responding the lease will time out and that could eventually lead to a failover when that happens the cluster log now will have additional information one of the reasons that we have found in a number of customer cases that we worked on is the least time a tends to happen when there is additional system pressure on the secondary replicas or the primary replicas and this could be due to high CPU usage or very high disk usage or a combination of both which is why you will see in the additional messages we're even writing out the amount of processor time being consumed the average disk reads and writes and the available physical memory on the box so that you get an indication of what the system resource usage look like when the least time out actually happened so let's look at what failover troubleshooting would involve the first thing obviously as I talked about you would look at the reason renewal it happens every five seconds if it doesn't there's a problem the sequel server instance does not respond then when Ally's expires we raise an extended event availability group these expired and that information basically tells you when the leaves actually expired and post that you can go look at the additional logs available in the log folder to find out what was the state of the system any questions yes we have a couple that have came in since our last time going through though um is there any expected performance overhead for the late C system that you had showing in your previous demo okay so for what I was running in the P sequel script for capturing DMV outputs that no there isn't or the extended events which captures the latency at a much more granule level then use extended events those are meant for reactive troubleshooting we don't recommend having them turned on 24/7 in a production environment it's only when you know that there is a latent environment and the way you would find out that your environment is latent in a few is if you're looking at your perfmon or your automatic ly monitoring them or you are capturing a trend based on the send and receive rates obtained from the HADR DMVs which is showing a different trend from the baseline awesome and then the new columns that you showed in the report built in the management studio and some of the perfmon counters are those available in previous versions of sequel server with the kb article that you mentioned at the beginning of the presentation or these just for sequel Server 2016 okay so the question is about the columns the additional columns that I showed in the always-on dashboard those have been there those are not new to sequel Server 2016 so all you need to do is right-click and get to those counters when I start my demo again for failover troubleshooting I will show you how to get to those additional columns those will be available in older versions of management studio as well awesome thanks in it thank you okay going back to my presentation so let's talk about which versions have it as I mentioned all this information is available in all versions that support availability groups 2012 2014 2016 now when you actually have a failover what do you want to collect when you actually have a failover the first thing you want to do is collect the sequel server error log from the time of the failure the windows and the system application event logs from the time of the failure also should be collected these can be quite valuable especially if you're having some sort of driver or firmware related issues or disk controller related issues these are very valuable to look into we also have something called the failover cluster instance diagnostic logs we maintain a maximum of 10 rollover files in the sequel server log folder these are also extended event log files and then there's the always on extended event health session we maintain four of them and we maintain system health session log files in the log folder and obviously the windows cluster log to take advantage of the enhanced logging that we do in 2012 2014 and 2016 so let me switch over to my virtual machine again and now before I move over to the federal troubleshooting demo to the question of how do we get those additional columns when you launch the availability group are always on dashboard from management studio if you right-click on the columns you will have the option to pick the additional columns like the last read on time or if you want synchronization performance you get to pick all of these and the the dashboard also allows you to group by or you can click on the add remove columns it gives you the same kind of view you can do the same for the replicas it lets you allow it allows you to add additional columns as well now the the sequel server arrow log whenever there's a failover it gives you a whole lot of verbose information but let's concentrate our efforts right now on the different extended events that we have for troubleshooting failures so the first thing I want to look at is the always-on health events now these are rollover files and we maintain only four of them you can look at the configuration right here if I right click and click on properties by default on any sequel server instance when you start or create an availability group we create and always on health session which captures the following events always on DDL executed as if you're doing an alter on your availability group it would get captured in this particular excellent extended event session we also capture when the lease expired we capture when there's an automatic failover validation and when there's a state change when there's a failover there's a state change any errors reported by the availability group and we also capture information about redo being blocked and now why is that important why why are we specifically looking at redo being blocked whenever there's a failover off the availability group the database actually restarts and a renew process is initiated so at that time if the redo thread gets blocked your database will not come online post the failover which means that the failover process will not complete and your applications will not be able to access the environment and that is going to lead to a downtime or business downtime so which is why we add the lock redo blocked event which allows you to quickly determine if the renewed thread is being blocked by any other process or by a system thread and allows you to take corrective action the corrective action obviously is not automated now if I look at the always-on Health session I can actually group these events together so the first thing I can do is click on grouping I can group by name and you can see that the state change is the maximum number of events reported here so let's go ahead and check what I will be able to see it tells me what of the different states it transitioned from where we're spending online offline and I've been shutting down my instances in between demos so obviously there's a lot of online and offline transitions the different errors reported in the event and this is basically some of the events which are reported in the air log and based on the destination that you see you will know where these messages are being put into the weather have been going to the error log whether they're going to the event log based on the destination you would be able to track them down in those logs as well additionally we also have the lease expired which I talked about and this is important it tells you that the lease was not valid which is why it expired again if you go back to a recording that I did for the previous virtual chapter session I talked about this in detail and there are power bi dashboards associated just with the lease expired event the always-on DDL executed is important that's because it tells you what command was executed doing during that time i can see that i'm modified the replicator asynchronous commit this is what i did at a previous time to change it from a synchronous commit for the set of demos that i'm running it also gives you state changes a primary pending primary normal it basically shows you the transition from the different states so this is the first thing that I would look at because there are only four files maintained you might not be able to glean a lot of information if you have not captured this information upfront the other option is to go ahead and and look at the failover cluster Diagnostics Lawd now this is basically a set of extended events which are available in the in the log folder again and this information is quite useful because it maintains ten rollover files of 100 Meg's each you can actually have about a gig of data in your log folder especially if you did not know that a failover happened these could be a goldmine and to dig through and find out why the actual failover happened so let's go ahead and and look through one of these logs John any questions okay so let me show you what this has this also has similar kind of information along with some additional information about your system health so the first thing is the state change it gives you information about the system being healthy or not which is always good to know because you probably might not have information about what the system health look like and as Murphy's Law has it whenever you have a failure even though you had monitoring it was not able to capture the data that you required to troubleshoot so having this at the back of your pocket is very useful I also get the availability group is a live failure it basically tells me that my group my replica died I get to see component information so to give a brief summary about what the component health result means it's equal to over 2012 and above we started capturing the SB underscore server underscore Diagnostics stored procedure output in the sequel diag XEL file we also capture the information in the system health extended event session but having this information in the historical data allows you to correlate what the system performance looked like when there was an actual failure so let me actually go ahead and add some additional columns to make this a bit more easier so for example I'm going to add component type and I'm going to add component to this data set and then I'm going to group by component and there now that if group by component I get to see some additional information about the system for example if I double click on this data I get to see if there were any access violations if there were any memory dumbs non yielding situations all of this data is kind of daunting to sift through if you're doing this for the first time again if you noticed in my folder I had a power bi dashboard file which allows me to sift through this data much more quickly so what you can do is you can merge all of this data when you open up all these extended event files you have the option of merging extend event files and what when you do that you'll get the information and you'll get the option of adding all the event files and then clicking on OK management studio Wilmore Jotham for you and then you can go to extend events and export and you can go to table it would ask you which database in which table you want to load all of this data to the other option is you can run T sequel commands to import all of this data and the third option is you can use API calls the extend event link reader to import directly from the file without having to use T sequel commands so what I've done is I've imported all of this data into a power bi dashboard I have taken all the different components the system component the i/o subsystem the query processing components and put all of them into a dashboard so now let's see how far back can I go so I captured this data yesterday but because I have historical data I'm able to go back all the way to July keep in mind that I have kept mines and shut down for quite a few days but the reason I'm showing you this is even though if you did not have data collection tools in place you still might be able to do a sufficient enough root cause analysis using the extend event files that are available in your log folder and by using a dashboard similar to this so let's look at me AG resource because I want to identify the times when I had a failover so these are my Li's validity times and I can see that on various days I had failure so let's look at 811 and I can go ahead and filter all of this data so let me go ahead and filter by that particular date so I can say is after let's say August 11th so let me take almost 1012 m and and is before August 13 so I can see that there is a very good indication of where it actually failed so it failed that at this particular time if I do mouse over it tells me 334 there was a failure and I can see that there was a state change associated with that particular time had at 332 and 334 and that correlates with that white space so every 5 seconds I was having Ealy's check which was validated after a particular time it wasn't so let me see if I can find out what actually failed during that time so let me look at the event time 811 let me sort in ascending order so that it's easier for me to read so at 333 at that time see that my events were being logged it said stopped health worker and at that particular time I had a time on so my instance was not responding I did not have automatic failover configured so it did not fail over which is why you don't see the state changes it shows that at that time it disconnected from sequel server and that was about the time that I'd shut down the instance so even though I don't have data collected from that period just by looking at the log folder I was able to extract all of that information and if I go to my query processing and I look at those dates for 811 I should be able to get that information as well so that's it's quite a bit down right there 8:11 I can see what my system looked like I did not have any pending tasks so I can confirm that there there wasn't a problem on my instance related to additional user workload coming in my eye oh stats looked fine I can again go check on that if I want if I can drill down that should give me sufficient information again at 8 I I don't see any Reds in there altmer actually green so I'm good there as well my memory usage I have a graph for this so at that time it just dropped so obviously I did something at that time and I know that I'd shut down main since at that time no fatal issues no access violations no latch warnings no non yield oh crop pages so by using this information you were able to pull out a lot of information really to your instances health and how the available group resource was behaving this information that you're seeing on this page is part availability group in this case type your AG if you had more than one then you would see more than one particular color in the graph any questions drawn let me pull them up here we have a couple um one is have you seen any latency or AG problems depending on the amount of databases that existed in AG or multiple availability groups like total of all databases that are highly available with AG's I think basically that the person is looking for some kind of guidance on how many databases could be in an AG or even be involved in a replica before you start to notice order thread issues or some other issues yeah good good question so I'm going to try and answer this briefly because that could be a session by itself so the way to look at availability groups is basically the amount of physical memory the number of CPUs and the number of worker threads that you have those pretty much will govern the scalability of your environment with respect to the number of databases if you put into the configuration if you if you look at if you look at the previous slide that I talked about on the previous slides I showed how many context switches we had to do to move a log block from one replica to the other on sequel server 2016 the scalability limit is higher on sequel server 2014 and 2016 sorry in 2014 and 2012 because of the number of threads involved at the number of databases that you could put in size and AG would differ so the way you would have to do the calculation is look at your physical memory look at how many databases will be acting as a primary replica on that bird for instance the number of databases that would be acting as the secondary on that particular instance and then we have a KB article which provides information about how many threads are required for 2012 and 2014 and you can do the math and then you can get to a number I would maintain at least a 5 to 10% buffer and I would also make sure that I benchmarked the system for the number of workers as required to satisfy my application workload because the total number of worker threads for the availability of transport and maintenance is shared between that between the AG system threads and your sequel server instance for certain operations in sequel Server 2016 we optimize that but again the calculation is still valid so I don't have a hard number it depends on the physical memory the total number of CPUs and the number of databases acting as primary or secondary yeah any other phone questions John yeah no great now I would add to that that I think that your last demo also highlights that as well to where you can look at worker threads that are currently being used to be idle so it looks like your your report can help monitor that as well but that you have in your demo yes absolutely so so what John's talking about is the query processing section so this gives you this pulls information from your system so if you are just not sure how many workers reg are being consumed on your system what you can actually do is you can just pull a sequel server system health session log from your instance import this into the power bi dashboard and look at how many workers are created and and how many pending tasks you have that would give you a good indication of what your troop would look like so if you just want to do benchmarking off the system held sessions we I also have a power bi dashboard on github so let me pull that up on the screen if I if you go to Tiger toolbox which is our github repository under system held session you will find the power bi desktop file which has the information about getting a similar kind of view but just for the system held session awesome okay so to summarize as I said all this information provides a very easy way of consuming the data that's already available in your sequel server instance we get a lot of questions from sequel server customers about how to interpret latency in their environment and also why did it actually feel over sometimes the conclusion is that there's not there's not sufficient data available to do root cause analysis but before you arrive at that conclusion I would always urge you to look at your sequel server error logs the always on held sessions the system held session and the failure requester diagnostic logs along with the cluster logs to find out what information is there the failover clustering diagnostic logs the xcl files stay on for longer just the windows cluster log is prone to override very quickly unless you have made changes or change the default configuration to retain them for a longer period but what I'm showing you right now has helped us a lot in a lot of situations where customers have called us there the failover has happened and not immediately some of them call us after two or three days to do cause analysis and those logs definitely help so that's all I had for today's session if you're attending sequel pass come join us at our session we're delivering another session on always-on enhancements always-on available to group enhancements the session code is DBA 313 M so we're going to talk about all the different enhancements that we put together in sequel server 2016 we're going to show you cool visualizations like this we're working on pushing something into management studio as well we hope to show you that as well along with some other cool demos and those are our contact details we tie back to a personal blog troubleshooting sequel comm the team blog is a kms sequel server team we are available on twitter at ms sequel tiger banerjee on earth is my personal Twitter handle and all the scripts for the demo that I showed today along with the power bi desktop files are available on the tiger toolbox Microsoft repository feel free to download them if you want to make changes go ahead make changes contribute back to our repository submit a pull request we will review and approve approve it as appropriate and thank you everyone for joining today it was great having the opportunity to present to you and talk to you about some of the support ability stuff that we're doing it's always good to connect with the community and package on for setting this up we hit a glove no problem thank you for giving this great content we greatly appreciate it and hopefully everyone will be able to take advantage of your session that you're going to be doing at the past summit here in gosh I think it's two weeks now it's coming up pretty quickly I know I'll be there throughout the kana get me as well alrighty with that we'll go ahead and we'll wrap up today's session everyone I thank you for attending and we'll look forward to seeing everyone next month thanks a lot everybody thanks Ike you have a good day you

Info

Channel: High Availability Disaster Recovery Virtual Group

Views: 11,698

Rating: 4.7802196 out of 5

Keywords: hadr, sql server, availability groups, latency

Id: 7Q0WwJ6RNNc

Channel Id: undefined

Length: 58min 33sec (3513 seconds)

Published: Tue Oct 11 2016