How to quickly diagnose the root cause of SQL Server problems with SQL Monitor - Grant Fritchey

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Wow gosh I made it I made it by a split second that was great so hey welcome back sequin the city we are going we are going in we're going this is going to be awesome let's talk about hey let's talk about sequel in the cloud City right now what we're talking about now rogue one right nobody nobody okay anyway rapid diagnosis a sequel monitor no rapid diagnosis equal problems with sequel monitor let's talk about my name is grant phrygia work for Red Gate software this is my contact information if you have questions after today about anything devops we've been covering that quite a lot we're going to be talking sequel monitor now query tuning some fun stuff around that if you've got questions on this stuff please let me know I want to help you now have my contact information this is my blog my email my twitter handle you can get in touch with me I will answer your questions to the best of my ability alright let's keep going so we've got some goals today that we're going to go through good this is on I want to use sequel monitor to identify pain points I want to talk about how basically how we deal with the fact that you know yeah you could get a phone call right you can get that phone call on sequel server things are offline there's a problem oh my gosh what do we do you know that's a great way to respond but but that's very responsive don't you want to be working in such a way that you're you're ahead of the game that's why that's what we're going to do monitoring that's why we're gonna put these kinds of things in place I want to show you how to walk through follow some information track through the maze of data that's there and then nail down you know where the issue may lie and then use those same tools to at least propose a solution I'm not going to fix the problem but I'm going to identify potential fix for the problem and we'll discuss not only what the potential fixes but why it's only a potential fix all right because you're also going to want to go back and retest everything you're not simply going to get some bit of a piece of information and apply it blindly that's crazy we need to focus we need to measure and let's talk about it so what do we want what do we want out of sequel server well we want it to fly we want it to work we want it to be online available and working well we want our businesses to continue to function we want everything to work we don't want the rocket to take off and do what it's designed to do this is what we're aiming at this is our goal so sometimes what do we get instead every once in a while things go pop when they're not supposed to it happens it's not that big a deal but it's something that you know it just occurs so the real question is is when things go pop when stuff breaks when stuff goes south when you get that phone call oh my gosh there's something wrong with the server what do you do well option number one cry in the corner you know just go and curl up in a ball and try to hide you know heck we're dba's we don't talk to anybody anyway we can just go away someplace and hide all right that's not going to work for most of us so what do we do instead well let's talk about measurement what we want to do is we want to do measurement you want to do design you want to do measurement you want to do thoughtful process this is all about coming at things from an attitude of knowledge an attitude of understanding and working through what knowledge and measures we have if we don't have knowledge and measures when you get them in place so if we're going to put them in place we're talking about monitoring now we've talked about DevOps in the DevOps roll all day long we've been going over and over this stuff the whole idea of you know developing code building the code using the various tools validating and testing the code ensure that you know what we're getting is in place packaging all that stuff up and then releasing it out into our production environment we have not and we also talked about provisioning we started out the day talking about provisioning with sequel clone which is really cool tool you got to check out what we have not talked about is the concept of monitoring we're going to monitor this the whole DevOps process is a circle it's something that's constantly going on your role if you are a DBA if you are a data pro responsible for production environment your role is still a part of the dev role as Alex just said if you were watching het session it is both dev and ops that's the whole idea of DevOps it's merging the two processes it is communicating across these two processes acknowledging that you were involved with both so monitoring is a fundamental part of DevOps and you need to make sure that you know that we've got that in there now sequel monitor is a monitoring tool from Red Gate software and it gives us a lot of these measurements it gives us a lot of these goals and ideas of what's going on so where does it start what starts off with your enterprise it gives you a whole view of your entire enterprise everything available and its status is it online is it offline have there been alerts if those alerts are red that's bad if everything is good it's green if things are it's in the middle excuse me okay maybe not wow that was weird thought I had to sneeze anyway if you know this is sort of management speak you know red bad green good but it's still the idea here is that you've got an overview what's going on and you get to break these down you know what am I production environment look like what does my staging environment look like because it actually matters whether your staging is online or not what does QA look like because to the QA team that is production what does dev look like because to the dev team that is production we need to know whether or not those guys are online and sequel monitor can track all that for you what else does it tell you well it gives you information about the alerts when things go bad backups have failed red line right that's a bad thing a sequel agent job has run long well you know that's interesting but or you know maybe just informational hey somebody logged on we're not sure where they came from hey that might be important we can get alerts on all these different things now what else going to tell us it can go down into the metrics right into the specific measurements what is going on with cpu what is going on with disk weights what is going on in various processes throughout the thing and most important not simply what is going on but the ability to compare today to yesterday this week to last week because when someone says hey things are slow on the server it's not simple enough to go okay things are slow slow compared to what now what you can see there on the screen is what we call a baseline and that is a normalized structure looking back over seven weeks so you can see that that little red spike which occurred last night is not that big a deal because that little red spike occurs every night around the same time what is that that is a backup process affecting how disk i/o goes over the day hey you know I kind of don't care that that's a big deal right but we've got the ability to compare that spike from last night to a preceding time to a series of preceding times and that gives us information so that we're no longer just responding to something that we saw but we're measuring and making decisions based on those measurements and that's our goal a new thing that we're introducing now is the ability to create some canned reports that you can take all the stuff together create a series of measures put them together into a single report and then email it to your boss email it to yourself whatever keep a track of it you know and put it into canned package so you can move it off and put it someplace and have it available for later it's great way to figure out what's going on over time it's a great way to report uptime and all kinds of fun stuff like that and so those those are all put together inside of a canned report but now let's go play because playing is a lot more fun now I've got a system oops I got a system that I clipped in the wrong spot haha I've got a system showing some time and various behaviors that I've put together this is my local server it is only you know running my local instance of sequel server so the monitoring on it's not 24 7 because I will shut down my laptop at night and so you won't see everything that's going on let me just zoom in a little bit now what we've got here is over a period of time I put a load on the system and that green line represents some io the blue line represents some cpu and you can see where you know one point my load thing was got turned off in the green line dropped CPU no cpu shot through the roof but what i want you to focus is in right there in the center where there's a severe spike in cpu let's pretend that around you know seven a.m. when that occurred i got a phone call or maybe even after 7am probably you know eight or nine a.m. i may have gotten a phone call saying hey you know what at seven o'clock we were running some queries and we hit an issue with this you can immediately go in drill in and take a look and say okay well you know i can see that there was a load at seven o'clock you know there was like you know the line before and after that little blue spike is pretty steady state there's a lot of activity going on but right around the time you say that there was an issue and we can even narrow it down and say what time was that oh that was at 654 huh seven o'clock for once that phone call actually was accurate on the time right how weird is that you know that you've got a spike on the cpu around to around eighty percent now that's that's a significant spike that's using up quite a lot of the CPU there and by the way I know that that's a higher spike it's a demo machine bear with me alright let's work together on this when we look down we can see over here that there was a steady state except for that one big spike in the middle and it's based on this highlighted area we see the information down here memory steady state no big deal we're using up every bit of memory we possibly can that's what sequel server does I oh it's doing stuff but nothing bad nothing abnormal and basically there's a pretty consistent behavior same thing goes with the network oops not with a weird a little jump that's my mouse so next down we can go down when you start looking at the top ten queries and we can see by duration each of the queries that are going on no let me zoom in a little bit so you guys can see now there's some sort of system query here looking at the extended events session targets yeah maybe that's important maybe it's not it got executed 24 times durations a little long a little bit of CPU from the duration and that's an average duration over that period maybe that's an issue maybe it's not we've got some other queries those take a look in there we're pulling out the plan handle we're looking at XP address and we're looking at there's an actual query query that was called 11,100 times huh okay that's quite a lot and there's another query going against a common table expression that was run a million times maybe that's an issue let's say that we know this system I know that the query that was called a million times has already been tuned fairly well it's got a good set of indexes on it and if you look at the average duration at four milliseconds I think we're okay right not a big deal the other one you know maybe or maybe not it's an issue let's zoom back out it's hard to know what's going on based on just queries so let's army base just on duration so it's also sort based on cpu time now if we look well we see the extended event session query is still up there use about a hundred and twelve CPU that's interesting we see that there's another query and it used a little bit of CPU and then a hey look at that number three is that other query the one that we don't know about execute 11,000 times and it's using a quite a bit of CPU time on average two well let's sort again let's sort of logical reads now if we zoom back in so you guys can see will notice hey look at that there's that system query going for the query plans and there's that address query again now it's bumped up even higher it's got 349 logical reads it's been called 11,000 times I'm getting more and more interested in this query it looks like a lot of stuff is going on with it heck if you'll notice the one query that we were worried about that one that was called over a million times it's bumped way down it's hitting almost no logical reads at all why it's a fairly well tuned query running in four milliseconds we well I'm seasick I don't know about you guys we know about it the number of logical reads it hit is 91 we've got knowledge about this query already and we know that it's running the way we expect it to even though it's called a lot and yeah if we could tweak tweak tune it better that might reduce our overall load but we know about that query it's not something we don't know about that other query that address line query that's popping out at me I've never seen it before I don't know why it was called 11,000 times something's going on with that it's pop is showing up over and over again in our sort so we've got something there we need to take a look at so let's take a look at it will click right on the query will zoom in we see that the query has got SOS scheduler yield oh wait what SOS scheduler yield you mean it's waiting on the processor huh didn't all this launch off of a cpu spike yeah it did cpu spikes Wow ok so number of CPU spikes caused by this thing the query itself you know it's going for the person address table state province table is doing a filter on the at City column seems ok I mean it's you know nothing exciting but we've got some information about it now now we can drill down there's other information available to us I'll give out Kate off my screen we can look at sequel metrics we can see what's going on across all kinds of stuff we can see page splits compilations all kinds of other information and you'll notice every one of these has got there's something happened at 653 there's something happened at 653 round 653 654 am something happened might be that one query that we've never seen before like I called 11,000 times anybody guessing up that me too mainly because I know I ran that query 11,000 times at 654 am I know I'm cheating but that's my job the other metrics aren't going to tell us as much so you know basically we've kind of narrowed down we know roughly when the problem what occurred and we've got a good idea what the problem might be it looks like that query is giving some issues so let's take a look at the query again and let's notice I kind of glazed by this just to show you some other information the fact is we can actually look at the query plan if we look at the query plan it's not a bad plan it's not complicated it's not hard to understand we've got to select operations pulling from merge-join which goes from a clustered index can a sort and another clustered index can now if we look at this this is super team is showing us this information it's built right into sequel monitor if we take a look at it you can see that the most expensive operation was eighty eight percent as a clustered index scan and then we have some other day too heavy operations moving a bunch of Records around hunter 434 records if we go back over here and we look we can see the whip sorry zoomed in things always act a little funny there we go got fly like a pilot minor adjustments minor adjustments eighty-eight percent is that one clustered index scan now that's interesting we have another cluster index scan but it's only doing one point three three percent now these are estimated values we don't know what the actual measures are but you can look at them and kind of get an idea this one is moving 100 records this one's moving 434 if we take a look at the properties scroll down a little bit the table only has 181 rose in it we can tell that by looking at the cardinality so it moved 141 of 181 records 141 records there's probably not a lot moving out of there let's take a look at the properties on this one scoot over scroll down and now we see a cardinality of 19,000 I suspect a scan against 19,000 rose is probably around the neighborhood of eighty eight percent of the cost compared to a scan of 100 some odd rows anybody agree on that guess raise your hands no no raise them I can't see okay I agree it's moving a bunch of data 19,000 rose is coming back and forth that's a lot of stuff moving around that concerns me now I'm not done with my concern let me zoom back out a little bit hide the properties again remember that Ouellette zoom back in so we can see it there's a sort why is there a sort well let's go back up to first one again take a look at the properties always back into the properties always back into the properties properties are what we're looking at Hey look at that it's ordered is equal to true hmm interesting well it's going to emerge join merge joins are very efficient oops wrong key merge joins are very efficient so merge-join takes ordered input from one ordered input from another and puts them together when we bet ordered input from one and we've got a sort operation I know what's going on but let's prove it we'll go over their properties will scroll down ordered is equal to false uh oh what happened well it's scanning an index what where is it scanning I don't know I mean I you know it's kind of hard to tell what's going on what if we you know I don't know again looked at the properties predicate oh it's at City so what's this index on well this index is on address ID we're filtering on City huh I wonder I wonder if I wonder if we can put an index on this and would help just guess it just guessing or maybe let's zoom back over here look at that there's a warning missing indexes missing index could improve the cost let's click on that and see what we've got missing index it suggests use adventureworks 2014 go create non-clustered index and the index name is by the way there are consultants around the world who sing hallelujah every time they see an index on one of your tables named name of missing index calm assists index please please please stop them singing reading the index before you create it okay we're all asking bad game please get that done all right so we're going to create an index and it's going to be in the address city table awesome and in theory Microsoft suggesting through the through the missing index information that it's going to move this query better by eighty percent whoops didn't mean to do that click the wrong X what happens then you get bad stuff okay let's close all that down so what we've now got is a suggestion for missing index I suspect we've got our solution however we don't we have is a starting point for testing our solution we're not actually going to implement this what we are going to do is we're going to go back to our dev system we're going to take a look and see like okay so if this index were implemented what would happen how does it affect other inserts how is it affecting the in updates deletes does it impact other parts of the system are there other queries that need the city column to be filtered let's check all that stuff out before we make a decision on this one missing index query however 11,000 calls if this is going to be a regular thing it's spiking the CPU it's causing waits on the cpu our CPU is already under load if we start seeing this a lot this thing could become a major issue so we've got indications that are going to walk us right up to creating this index but is this index going to be helpful well we won't know without testing I cannot tell without testing but I can make an educated guess and I'm going to make that based off the query so if we put an index on the city column for address what's going to happen well the optimizer says I can use that index to retrieve data and it will if we were to put this together put it inside of a test system put it together and try it out what we would see is it would take that index it would do an index seek it will go down there and find the rows that it needed and that's great and then it would add another couple of operators the next operator would be a probably nested loops join and the other upper be a key look up what's it doing well if we notice there we're pulling out the address ID the line line one line to city and postal code all from the at the address table if we just put an index on City we still have to go back to the clustered index to retrieve the data ah we might need an include a statement on the index maybe maybe this is all about testing this is all about validation we need to test and validate whether or not that's true probably it is but that adds to the impact can we trade disk space for performance it's a decision you'll have to make and measure using the same set of tools that we've got in front of us okay so now we've got a proposed solution all from sequel monitor never having left sequel mantra we've at least proposed the point of a solution so let's talk about it what did we do well we opened up sequel monitor and we spotted a spike we spotted a problem especially because we get that phone call we think something happened at 7am sure enough something did happen at 7am there was clear spike on cpu we had a problem we drill down on the details we started looking through and sure enough there was compiles that occurred at 7am that you know compiled stored procedures that obviously hadn't been compiled before we saw weights at 7am around CPU we saw all kinds of stuff occurring at 7am we filtered our query various ways so we could see well okay so CPU which ones used more ok disk which ones use more a duration which ones you emotional recalled most frequently which ones lined up frequently what came up what data presented itself what measures came up to us we then investigated that data we drilled down took a look at the query we took a look at the stored procedure we dug through the information and data at hand so that we could you know start understanding what was occurring and then using the information provided to us we identified a possible resolution possible resolution I probably should have put possible in the slide deck it says a but we identified a possible resolution so taking the problem and identifying where that problem occurred drilling down on the details of the problem looking at the data in various ways slicing it through various ways using sequel monitor investigating the information provided to us and sure enough the one query that we look like that you know that we hadn't seen before that was showing up had weights on the cpu and then drilling into it we could see that this query had issues clearly needed an index and if we put that index in place at least it would work better probably we'd want to modify that index to make it work best but regardless we were able to go through all this stuff from sequel monitor and put it together so what were our goals of the session today use sequel monitor to identify primary pain points of sequel server hopefully we showed how that can be done how that occurs where you can drill in and find that information follow that information to localize where things were happening that's pretty much what we did we spotted you know the area that we were concerned with we were able to drill in on it and make some decisions there that led us down to the path we needed to be with that we then tipped the tools that we had at hand and tried to put it together so that we could propose a solution that would bring us to a resolution that was the goals of the session hopefully the the goals and what we did matches up and so our future looks bright right the shoot has deployed everything's going to land the rocket flight is successful yeah no rocket flights not exactly a great paradigm for sequel cervix we kind of like want it to fly continuously not fly then come back down but you know hey pretty pictures and its presentation so I hope you enjoyed talking about single monitor the future looks very bright and let's answer a couple of questions first up ah well this isn't related to monitor but a sequel tool belt work with either sequel taboo can't even say it does sequel tool will work with Azure sequel database I'm sure it works with seat infrastructure-as-a-service but what about sequel Azure hey I am very very very excited and pleased to announce that it absolutely works with Azure sequel database it even sort of kinda works with Azure sequel datawarehouse but not completely there's some issues around but as your sequel database absolutely we supported all the way sequel prompt works with Azure sequel database yeah but compare data compare all of them matter of fact compared data compare is a great way to do some migrations and stuff to Azure as your sequel database now if you want to try out sequel monitor you can there's a we have a live web site sequel monitor metrics redgate com that is sequel monitors running all the time on our sequel server central database so our servers are online our monitoring tool is our monitoring those things live so you can go in there and play if you go to sequel monitor metrics reggae calm now I did put up my new blog post and the blog post them at Redgate calm oh that's way too long to say out loud get it wrong go to the reggae blog at Redgate calm you'll find my monitor articles the very top one and it basically walks you back through the scenario i gave with a lot more detail than what we went over here and so all that's put together and that's ready to go do we have any other questions that come up around this stuff okay excellent that's good to know I want to thank you guys I've got one more session a little later it's going to be pretty cool stuff other net we're up with another different session coming up right now and thank you very much the sequel in the city we're having fun

Info

Channel: Redgate Videos

Views: 1,421

Rating: 5 out of 5

Keywords: redgate, grant fritchey, sql monitor, sql server problems, monitoring sql server

Id: 8MU0Pd1WYpI

Channel Id: undefined

Length: 28min 10sec (1690 seconds)

Published: Wed Dec 21 2016