An Introduction to Microsoft SQL Server's Statistics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so let's say that you're a sql server and you want to go run a query and for the example of our table i'm going to use playing cards i'm going to say that every card inside here is a row well if i'm going to do gambling so to speak if i'm going to do a select out of here and i want to guess how many rows are going to come back let's say that i want to go select star from table or select star from playing cards where the face is a 2 or where the denomination is a two sql server needs to know roughly how many cards are going to match so to do that whenever you go to run a query the first time if sql server is like i have no idea what's inside of here sql server will actually open it up and create a statistic a statistic is one 8k page that sql server uses to describe the data that's inside these so i'm going to go create a statistic now just like sql server would i'm going to say denomination and then what sql server has to do is it has to go through and read these and put them into piles so let's see here i have four twos i have four threes four fours four fives four sixes four sevens four eights four nines and four tens so let's go right out our statistics now building statistics is a lot of reads you have to read the entire table or i'll talk about a shortcut here in a while or i have to sample it but there's a whole lot of reads going on but it's only one 8k page that we have to write so i'm going to say twos we have four of those threes we have four of those fours we have four fives sixes sevens whoops corruption already nine and tens we have four so now i know roughly how many of each denomination i'm going to find also let's say that i also want to know about the hearts and clubs and spades and all those that's a different attribute of our data i might also make a statistic on suit i might say that for hearts we have nine of those uh now this is when my art starts to suck and this is where i accidentally draw all kinds of nasty terrible things my wife said that my clubs look like chickens and then diamonds we have nine of those as well so okay so now i've got these statistics that i can use whenever i want to go run a query against the table if i want to go around a query and i want to say select star from dbo playing cards where numbers equals 2 then i know even before i reach into here if i go over and look at my statistics this is what sql server uses in order to build query plans when you go to build a query plan you only look at these you're not allowed to look in here so right now if i go to write that query select star from playing cards where number equals two i can guess that the estimated number of rows inside there is going to be 4 and sure enough when i go to run it it's 4. now that works all well and good if your table is completely read only but let's say that it's not let's say that someone doesn't insert if someone doesn't insert into our table and then we go to guess how many rows are going to come back where faint number equals 2 we're going to still estimate that there are four rows that come back even though if i go to look now there are four twos inside here and same thing if i go to query how many hearts are in my cards if i go back and look to see how many hearts i think are going to be in my cards i'd say well i think there's going to be about nine of them but i'm going to be incorrect and the more that people insert data into my table if we insert a few more rows then the data may become even more incorrect select star from cards where do not where number equals two okay there's there's going to be about four inside here well oh man now i'm not even close sql server needs to keep these statistics up to date it won't do it in real time it doesn't continuously change these things every time somebody doesn't insert update or delete that would be too much contention and blocking around this page so depending on your version of sql server and the size of your tables sql server will start to update your statistics more often generally speaking you'll hear stuff out there that whenever about 20 of the data in the table changes then sql server is going to take a breather and go update the statistics that 20 number grows smaller with newer versions of sql server and with larger tables the larger your table becomes the more sql server understands that it's important to keep those statistics up to date now you don't have to wait for a sql server to do it automatically either you don't have to wait for a sql server to watch for changes to this what a lot of us will do is we'll set up jobs that pop open our statistics on a regular basis and update them but there's an overhead to that if i want to go and update these let's say that we've decided that these are out of date well now when i want to go update my statistics i've got to go through and read all this data again so now let's go do it all right so we got we've got eight twos so for denomination we now have eight twos and then i gotta sort them out by threes i gotta sort them out by fours i gotta sort them out by fives you don't probably want to see me do this let alone you don't really want to do it yourself every stat that you have requires reading the table because i have another stat on suit that i also need to update sql server doesn't just read this whole entire table once because you can kind of see why i can't i have to break these things up into buckets so sql server will scan this entire table once for each statistic that would get pretty tiring you wouldn't want to sit around and wait for that to happen over and over again so what sql server does is a lot like what political pollsters do you know how whenever there's an election coming up your political pollsters they just sample the population they don't call the entire population they don't call every single person that's out there in order to figure out how they're going to vote we sample the population we just take a rough guess we know that we've got about 40 cards give or take inside here let's pull out 10 of them let's pull out just a random card from here random card from here pick a card any card i've always wanted to say that we'll pick a card from here and we'll pick a card from here so we picked out 10 cards that we believe roughly represent our population if we go see what we've got on here we might use these to make our statistics instead of full scanning the entire table to get the numbers exactly right what we might do is we might say denomination and we're going to say that we have two twos we have one six and we have one eight except that we know we only scanned about ten percent of the table so let's multiply that by 10. so if i was going to use these statistics to guess what my entire table looks like i would be insanely incorrect now fortunately for us well let's go ahead and do the suit one too while we're at it if we use this to represent our our population we would say that for hearts we have one for clubs i know i suck at that we have two for diamonds we have one and of course i only sampled 10 of the population so let's say that we'd have like 10 20 and 10 there aren't even any spades in here at all just by random luck of the draw so if you just use these sampled statistics in order to guess how the entire population would look you'd get some query plans that may not be very accurate and plus if you use that technique with every time you update statistics guess what happens the next time that i pull out four cards if i pull out another four random cards they may nowhere near represent the same thing my stats may change dramatically every single time they get updated if i do sampling now depending on how your big how big your population is and how diverse it is and how many outliers you have sampling might actually be okay for other applications you may want to actually read every single row in the entire table every time that you update statistics but you just have to do the balance of how exactly correct do these need to be and how long are you willing to tolerate waiting for your statistics to be updated is it bad if your query plans are way out of date if you've been changing this data a classic example of this is data warehouses where every night we load a bunch of brand new data and then what do you do in a data warehouse you immediately query the data you just loaded for example you load last night's sales into the table and then you go query for last night's sales but if you look over at the statistics the statistics wouldn't reflect last night's sales until you updated the statistics to account for your brand new loaded data that's why it's often common in data warehouse type scenarios that whenever you get done doing a load you also update statistics so that as your users come in the next morning they have absolutely perfect query plans that match the brand new data that you just loaded sometimes you'll see people doing very frequent index maintenance and index maintenance is essentially making sure that all these cards are in nice order for example let's say that i was ordering these by the denomination of the card i'd be shuffling things around in order to put them into the right order this is what your defragmentation jobs do they're trying to lump data into the right order so that when you go and seek for them they're much faster when you rebuild indexes like this it's if you specifically if you do rebuilds when you rebuild your indexes sql server has to touch every page so at the time you do index rebuilds sql server also updates the statistics with full scan because after all he had to touch every single row in the table it's the perfect time to update our statistics with full scan that's a benefit that you get of rebuilding your indexes except there's a really great blogger michael j swart who likes to say that rebuilding your indexes is the world's most expensive way to get statistics updates if you think that rebuilding indexes is improving your performance odds are it's really just the updating of the statistics that's really doing that for you try updating your statistics before rebuilding your indexes all the time because updating these is really inexpensive yes it does involve reading the whole entire table but it doesn't involve writing the whole entire table so now let's see how it works in sql server what i'm going to do is i'm going to take a table that's designed this exact same way and i'm going to look at statistics and see how sql server does with it all right so now let's go into sql server management studio everything that i'm going to show you in here the demo scripts all of it is already up at brentowsar.com go slash learn stats so let's go down here and let's go create a table so let's say drop table create table and then insert into there the perfect set of playing cards that we were just dealing with a second ago just the numbers 2 through 10 and the hearts club spades diamonds etc now let's go in and run a query let's say go show me all of the playing cards ordered by denomination and suit i really wish i could get it to show reds inside here because wouldn't that be cool but i'm not quite to that skill level just yet so here we got my ta my stuff all my twos my threes and so forth you can see that i only have two hearts or two or four twos that i only have four twos what if i go and try to run a query saying just show me all of the twos how does sql server know how many twos there are when you create an index on say denomination sql server also creates statistics on exactly the same column i created an index on denomination just now when i flew through this now sql server goes all right here's what i know there are 36 i always use a guy's voice for sql server because he's dumb and stubborn and he refuses to ask for directions he's all trust me i got this when he doesn't usually got this sql server says well there were 36 rows in the table at the time that i sampled it and i sampled all of them because it's there's not a lot of stuff stats inside or not a lot of rows inside here so here it looks like we've got a histogram with only five buckets in it it looks like maybe sql server only tracked the twos the fours the sixes the eights and the tens but he didn't he actually tracked all nine all nine numbers on the cards and it's really neat how he did it this first bucket represents all of the numbers up to two and including two sql services well for the numbers range up to two there are exactly four rows equal to two there's nobody else just the twos and that's it now between two and four like greater than two and up to including four there are for the fours there are four rows equal to four then there are i probably don't need to keep that voice up the entire time there are four rows inside that range above and beyond the fours and the other four rows inside that range would obviously have to be threes because that's the only other thing in the range higher than two and up to four so if you pick any number from two to ten sql server can be fairly confident that there are four cards that match same thing with suits if i look at the statistics by suit because i also created an index on suit sql server knows that there are 36 rows inside here it knows that there are nine clubs nine spades nine hearts and nine diamonds doesn't really know about the combinations yet so if i go in and start running other queries we'll start looking at say select star from the number of rows where denomination equals two there are four rows that have denomination equals two if i look at the execution plan you see how it says four of four what that means is we brought back four rows of unexpected four another way to see this is to hold your mouse over the table scan and look at estimated number of rows versus actual rows so if i look over here it says estimated number of rows each time that i do this is four actual number of rows is four so far so good estimates versus actual are really useful for when you're trying to guess how much memory is going to be required whether you should do seeks or scans which table should be processed first in an execution plan when i'm teaching my classes like mastering query tuning i talk about how you should check to see where estimates versus actuals start to drift off the further that not like you drifting off now in this webcast but the further apart that they get the more likely it is that you're having a problem when sql server estimates incorrectly either too high or too low that's when you start running into problems like he picks the wrong operations for the query doesn't uh re uh doesn't cre allocate enough cpu cores for it doesn't allocate enough memory that kind of thing ends up spilling to disk for example so right now he was able to accurately estimate that four rows were going to come back but what happens if we change the query what happens if i say go show me all the playing cards where denomination equals two and the suit is a heart there's only one two of hearts right two of hearts two hearts that beat as if i look over at the execution plan oh that's that's a little weird he says i actually brought back one row of unexpected two that's kind of weird if you hover your mouse over that table scan operator sql server says estimated number of rows per execution is two why well because he really doesn't know yet about the combinations between denominations of suit and denominator combinations of denomination and suit if i look at the statistics on dominate denomination not domination that's a totally separate webcast and you don't get to see that one for free insert only fans joke here so here here's my uh statistics on denomination sql server knows that there are four rows that are twos but he has no idea what percentage of them are hearts what percentage of them are clubs and so forth the only thing that this statistic is on is denomination and that's it but i can fix that i can create an index on denomination and suit or suit and denomination so right now just to remind you before i go and create a two column index here's what my stats look like my stats are only on denomination and nothing else if i now go in and create a set of two column indexes i'm creating an index on denomination and then suit and then suit and denomination because i like to swing both ways you can tell where this webcast is going and then if i go look at the stats again let's now go look at the stats on denomination and suit here my stats look different instead of just knowing data about the denomination sql server also now knows about denomination and suit kind of but not really because if i continue to look down in the histogram sql server still knows that if i'm looking for twos there are four of them but he doesn't really know how unique this is he is assuming a relatively evenly distributed evenly distributed distribution an equal distribution across each of the suits let's see how this affects our query results now when our data is distributed evenly our execution plan actually looks good now instead of estimating two instead of estimating uh two rows that are going to come back from here it only estimates one row is going to come back from here if i hover my mouse over that to see in a little bit more details it's even exact estimated number of rows per execution is exactly one that won't be the case here in a second as i start to change some of the data but here thanks to our introduction of two column indexes and or two column stats i get much better estimation when i'm filtering for two separate things this is part of what we have to do when we're doing query tuning index tuning statistics tuning is we got to figure out how we can teach sql server as much as we can about our data because the stat is all he has to go by when he's building these execution plans let's try another change to the data let's try loading another two of hearts in i've added just one more two of hearts and so now how does that affect my execution plans i didn't update my statistics all i did was insert a row now remember sql server statistics don't keep pace with every single change of the data sql server doesn't automatically update stats when just one single row goes in now i have two two of hearts and if i go look at the execution plan oh man that's not quite as good and if i hover my mouse over there to go look at the exact estimates the estimates are a little different sql server knows that there's another row in the table but he hasn't updated the stats so he's just like 1.02778 now overall he's correct there's going to be slightly more results on average across the board but he doesn't know where specifically more rows are coming in he doesn't know that i specifically loaded in a two of hearts so his stats are just the tiniest bit off to see why i can go ask what the stats say today's the day for one hit wonders i suppose uh so maybe they maybe that guy had two hits i'm not even sure but if i go look at what the stats say it's not like he has a high voice i don't know why i just did that i can't really scream like he does or else i'll blow out the microphone and y'all will get really upset so up here it still says that there were only 36 rows inside the table and it still says that there are only four twos so let's go update statistics to give sql server the best chance that he can of guessing how many rows are going to match you'll see that i did update statistics with full scan it's up to you to choose whether or not you want full scan or whether you want to do sampling or if you let sql server choose he'll choose sampling as your table size to start to grow and we talk about that in our statistics classes and how you make the decision on those now that the stats are updated let's go see what the stats say again now the statistics say we have 37 rows or 37 cards inside here and now he knows that there are five twos but does he know that there are two two of hearts no there's nothing in here that goes into specifics about specific rows and combinations you can't there's not enough space inside here your statistic is only one 8k page of metadata and that's it if you wanted row by row details like what the table actually has you would have to go through and have something that's the size of the entire table obviously that's not going to work with the size of your tables saying you have big tables size doesn't really matter it's how you use them and you use them poorly so if i go in and i run it now if i say okay show me the number of playing cards where denomination equals two and suit equals art i have two of them but what did sql server guess oh he only guessed one kind of if i hover my mouse over there he says well i believe that there's 1.35135 rows he doesn't know which suit has more rows in it he knows that two has five rows but he doesn't know if the extra one is a heart or a diamond or a spade or a club he just knows there are more twos now he doesn't have specifics about row by row outliers and you couldn't possibly track that on a single 8k page i hear from some friends of mine who work in other database platforms that in other database platforms you can determine how large the stats are like you can say for this table i want the megabyte worth of statistics or you can say for your big data distribution tables where performance is really important you could change the size of data that you use for statistics in sql server we can't do that we have filtered statistics so i could create statistics say just about the twos or just about the hearts or just about the two of hearts starts to get into some really complex stuff and i don't know a lot of people who do that successfully i know about a lot of people who do it in a crappy way especially because they don't understand that sql server is going to scan all this in order to build your filtered stats so got to be kind of careful about that so what if we then go in and try it for a different number now right now we were looking at two of hearts and the two of hearts said he estimated 1.35 if i compare that to the three of hearts oops again sounding another one like one of those webcasts if i compare it to the three of hearts there are only one of those and if i go in and look at the execution plan sql server does a good job of saying hey i know that there's about shouldn't it just be one we know that there are four threes and we even have statistics on both index and suit but that doesn't say exactly one well this also plays into something i talk about in mastering query tuning i'm really only worried about estimates errors when i'm more than 10x off like if it estimated one row came back but 200 rows came back we're probably going to run into a problem as we start to join this object to other objects but you'd be surprised how well sql server can recover from cardinality estimation errors that are less than 10x off sql server still does a decent enough job about being flexible on its plans and when i say flexible on its plans plans don't change once the plan is compiled i mean the shape of the plan that is what it is if sql server brings back way more rows than he expected he's not going to go re-juggle the shape of the plan which again other databases may do sql server doesn't do at least not yet which is kind of odd because if i look sql server even knew if i go back to the stats sql server knew when it was going into here sql server looked at this histogram bucket and said well for everything in between two and four there are four rows equal to four and there are four other rows inside that range four in the range means in between two and four which obviously that's only threes why wouldn't sql server guess exactly one well he doesn't know that your data is evenly distributed across four suits now when you think about playing cards you probably think about saying well i know that for every one number there's only one suit like there's only one two of hearts or there's only one three of diamonds you can tell sql server that too that's called a constraint you can implement constraints on your data again not restraints that's a totally different webcast over on my only fans but you can implement constraints on your data and that will tell sql server what to expect that's really useful when you know something is going to be unique or when combinations of things are going to be unique but then also you can't violate that you wouldn't be able to put two two of hearts in there sql server would stop you at runtime saying i'm sorry dave i'm afraid i can't do that speaking of which let's go through wild and crazy and let's make all of the twos be hearts so i've executed this now and all five of our rows were affected if i go in and ask hey sql server how many two of hearts do i have remember i didn't update enough rows to hit our threshold of a whole lot of rows being updated so sql server just flat out doesn't know that all the uproars that i updated have all gone hearts now and our updates get exc or our estimates get exceedingly weird this is just like that data warehouse loading process that i talked about if you're changing a bunch of data every night and then you immediately go in and query that exact same data your estimates are going to suck it's up to you to be proactive and say well the data that i just really care about changed a lot so i should probably go update my statistics and then when i go to run queries about say last night's data then i'm going to get beautiful what i don't get beautiful estimates hold on why not because let's update our statistics again with full scan just to make sure and then let's go look at the statistics if i look at the statistics by denomination and suit what you're looking at here is all sql server knows sql server knows that you have two rows equal to five and he knows what the average distribution is across any denomination and suit combo but he doesn't know that there's an outlier here for number two it's another interesting joke for my only fans channel here sql server just knows that there are two five rows equal to two it doesn't know that they're all hearts there's not enough space in here in the histogram to describe your outliers everyone else's data is kind of evenly skewed kind of but if i go back and run that select again to see what sql server estimates sql server estimates that come and zoom in here a little if i want to find all the twos that are hearts sql server just estimates that there's going to be 1.756 rows similarly if i go in and say three of hearts we know that there's only one of these there's only ever been one three of hearts and if i look over at the execution plan it looks right if i hover my mouse over it though wait what the fact that two's distribution is all skewed towards hearts also affects the estimates for everyone else when you're doing dual column estimation like either looking for all the twos of hearts or fives of diamonds whatever one group's outliers also affect everyone else's estimates now obviously here i'm working with an extremely small data set like 40 rows and i'm trying to describe how sql server does things at a much larger scale if your data is very evenly distributed none of what i'm talking about matters to you that much at all but no one's data is really that evenly distributed and the more outliers that you have the more you're going to struggle with challenges with this so let's come down to the takeaways on this zoom in down here and give you these so stats aren't as big as indexes in any one 8k page you've only got a minimum amount of data in there in order to describe your outliers as your data changes as you do inserts updates and deletes your statistics don't stay up to date in real time they stay up to date whenever sql server updates them automatically that's probably not going to be enough for you you're probably going to want to set up statistics updates jobs in order to keep these things in sync then even if you update your statistics all the time the outliers are going to cause you issues to learn way more about this doug lane has put together a whole series of videos i want to say like two three hours long describing how sql server statistics work you can go get that over on our youtube channel you can get there by going to brentozar.com go slash learn stats where i've also got the entire demo script that i walk through in this session in here this whole thing is mit licensed so you can go give this exact same demo to your team if you want or use it to explain why you're having some of the cardinality outlier issues that you're having hope you had fun and i'll see you in the other classes
Info
Channel: Brent Ozar Unlimited
Views: 25,592
Rating: undefined out of 5
Keywords:
Id: uqkf1VVZulg
Channel Id: undefined
Length: 35min 30sec (2130 seconds)
Published: Sat Aug 15 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.