Working with Very Large Tables Like a Pro in SQL Server 2016 Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
attendees are in listen-only mode good afternoon and welcome to the past database administration virtual chapter meeting today our topic is working with very large tables like a pro in sequel server this is part one of a two-part lecture with guide Glanzer my name is Julie Bloomquist and I'm a volunteer for the virtual chapter the second part of the presentation will be on August 9th at noon Mountain time please join us for the second part of this presentation as I said we are part of the past organization there's a business analytics Day coming up on August 4th over in San Diego and other items as well you could register at past va.com and be able to attend the conference for a limited time there's a discounted rate of 475 or the conference past summit is the largest sequel server conference is going to be October 31st to November 3rd in Seattle Washington it has multiple tracks and is for developers as well as database administrators or managers if you use the code vc-1 5fps 6 you will receive 150 dollar discount for your registration Julian yes we still see us your first line so I don't know I don't know why it's not advancing do you see my second slide now yes okay so here's the information about the business analytics here's the information on the past summit we don't see yeah we see the slide about the budget looking for visionaries it's not in presentation mode showing Microsoft PowerPoint I just say start the slideshow from the beginning or from the current slide so is that the past summit slide yes but it's not dealing with UDL desktop you see this how about now yeah much better great okay we are you see the GoToWebinar control panel I fit a slide about the past virtual groups okay we're part of the passes virtual chapters and there's quite a large number of them you can take your past membership great ok do you see the directors at large yes okay you could take your membership and associate with any of the virtual chapters and you get emails about the meetings which are upcoming the 2017 past directors at-large elections are coming up if you're interested in running for a seat please contact the governance at pass org there are three seats up for election for the 2018 2019 term passes looking for visionaries to also serve on the Past forward it's what you'll be doing is collaborating with industry as experts building the future of paths and fostering a community of professionals sequal Saturdays are one day many conferences they are held throughout the United States Canada and throughout the world it gives you a flavor of what the past summit might be because they're a mini one-day conference the fees are usually very reasonable just covering the luncheon costs we have Pass has in-person meetings local chapters as well as the virtual chapters so if you go through your pass account you can do a search and see if there's any local chapters in your area if not you might be interested in working with pass to start up one or sign up and attend the virtual chapters a lot of the physical chapters also do record or have LinkedIn or the meetings broadcasted so you might also sign up with a local chapter close by and be able to ask questions volunteering with past there's quite a few opportunities you could send an email to the volunteer for you could do volunteering at sequel pass org and volunteer recognition s equals s pass org is for the outstanding volunteer of the month you can make recommendations and submit people for that if you discovered this virtual chapter on YouTube please visit the sequel path org and become a member it's free you could join lots of the virtual chapters and be able to download the Supplemental materials in the scripts there are many ways to stay connected with pass so we have the guy Gantt glance err today he's a Data Platform MP MVP he is a leader in the Israeli past local chapter and also the CEO and founder of Madeira Data Solutions his career has been focused on the Microsoft data platform for the past 20 years performing various database roles as either an on site DBA an external consultant or speaker guy is involved in many activities in the Microsoft Data Platform community he occasionally speaks at community events such as past summit sequel bits sequel Saturday and user groups around the world he also co-hosted the sequel server radio podcast so guy I'm going to pass the presentation over to you excellent and I see your opening slide Vickers this is being recorded and it will be posted on our archive of the DBA virtual chapter or website thank you the guy will be having stopping points throughout his presentation so if you have a question please type it in the question window great Thank You Julie and hello everyone so I'm guy and we're going to talk today about how to work with very large tables like a pro in sequel server 2016 but not only 2016 I'm going to talk about new features in 2016 also some new features in 2014 and features in general in sequel server that we can leverage in order to work with very large tables learning from the past we decided to do it two part series two session usually when I speak I have lots of topics to share and lots of demos to show and I always find myself running very very fast to keep it on time so this time we're going to have enough time I'm going to have enough time to show you everything Joe and we'll have time for questions so today is part one and we'll have part two on August 9 surgery we tended me already I just want to say about the sickle cell radio podcast if you haven't tuned into sickle cell radio then I invite you to go to see Kosovo radio.com and join the podcast me and metallian one my colleague we released the show podcast every two weeks and we talked about everything around sequel server about our own experiences we have some guests that we interview once in a while we talk about community events new feature new leases everything around sequel server so if you're into sequel server and I guess you are if you're here this might be of interest to you so again the agenda for today working with very large tables like a pro in sequel server 2016 first I'm going to talk about what's very large Tivoli but it means in what was what count is a very large table what it means to work with a very large table orga has aspects of working with very large tables and not just working with them but we get like a pro so this is where we're going to talk about all the features of sequel server that we can leverage to do it like professional and like I said before I'm going to talk about a few of the new features in the recent versions of sequel server that we can leverage so let's start with what is a very large table for example is this a very large table let's say it is five gigs table is it large you consider large what about this one a five terabytes table so of course the d'art answer is it depends okay this is always the right answer to any question I think and my definition of what's a very large table is is a table that requires special clear meaning if I'm a DBA and I have my database and I have all kinds of tables and I do things in a certain way I have some processes and operations to back up tables or databases or build indexes to query the tables cetera if at some point the table becomes one of my tables becomes so large that my current processes and operations are not good enough I need to do something else I need to acquire new skills I need to use new tools in order to cope with the size this is the point where it becomes very large because for one person in one environment five terabytes table mics because is very large but for another environment another person this might be considered a small or a sized table where we can use the same processes the same tools the same skills to do everything like we do with the one gig table so mission for the sake of this demo what what I mean when I talked about the very large tables is a table that has a date/time column and it's an ever-increasing calm so we have lots of inserts into the table with the data in column increasing all the time and it's a kind of an activity table that log table I think every database has such a table even if it's if it's a data warehouse it would be the fuch table if it's an all the free database it will be something like page views messages measures events transactions something like that I think you know what I'm talking about those are the thoughts they was the main tables the main operational tables in most databases and they become very large very fast so for the sake of this demo I'm going to use page views table let me switch to management studio so I have here my very large tables demo database with a few tables and the main table is the webpages so it looks like this okay a very simple table a of an ID it's a sequence I have a URL you can see very meaningful URL some reference code some session ID and of course date and time which is ever-increasing and currently this table holds five million rows again five million rows can be considered very large maybe not for the sake of this demo we're going to consider this table as a very large table and that's it that's my table so the entire demo is going to be with this table so this much for the definition but what I mean when I say work with very large table I'm going to cover four areas of how to work with very large tables first is how to manage the table by managing the civil I'm talking about the data whether we store the data and how do we move data around mainly about archiving all data or deleting all data stuff like that of course we're going to talk about partitioning then we're going to talk about how to query the table so when a table becomes very large there are some challenges regarding query performance and we're going to talk about some features and things you can do in order improve performance of our queries against the very large table I'm going to talk about updated table not specifically the update statement but meaning changing data inserting updating deleting mainly inserts I think that the most challenging part about very large tables is the rate of inserts and we're going to talk about the throughput how can we increase the throughput and insert rows into the table faster and the last thing is how to maintain the table in an efficient way and when I talk about maintenance of tables I talk about rebuilding index set up big statistics backups stuff like that so again when the table becomes very large it becomes more challenging to do maintenance so we're going to talk about some features and processes to do that so those are the areas we're going to talk about and let's start with managing the table so back to the demo so what I going to do now I'm going to start with the meta data query from 6 partitions I want to show you what's the current state of the table so currently we have a single partition the table is not partitioned we are the clustered index on this table it's a it has a single partition as you can see we have around five million rows at this point and of course one of the first things we should do with very low budget table is to partition the table the better way is of course to plan in advance design a table in such a way that it's partition before we start using it but even if we have already have a very large table we can still partition it and I'll show you right now how we can do this and of course partitioning is the feature that has been around in sequel service since 2005 there are some advancements and improvements I'm going to talk about some of them and later on in the recent version the reason we want to partition very large tables is not because we want to improve query performance this is some common mistakes that people do partitioning is not meant to improve query performance against the table in fact in some cases it might even degrade performance and we need to do to be aware of that and understand how to do things how to query such petition table the main reason for petitioning the very large tables is to improve the management of the table and mainly the archive process the sliding window so let's say we have some regulation and we need to keep the last seven years of data in the table if we have four to seven years of data we need to archive or maybe delete the the eighth and the ninth and all the data before the past seven years if we don't have a petition table just one single partition then archiving moving the data away is going to be a heavy operation it's going to be complicated to accomplish that we're going to have to implement some kind of looping mechanism to do it in small chunks that we don't lock the table for a long period or in blow the way that the transaction log files so when it's partition is going to be much easier and I want to demonstrate how to do this here so I'm going to take this single partition single partition table and I want to partition the table based on days so each day will have its own partition so first were a to do is to create a partition function and I have a script here that does this dynamically I want I already have I didn't said it before but I have six months of data around 180 partitions of data or 100 days updater currently in the table so I won't stick it into 180 partition partitions I can do it manually I can do it inside a loop or something like that here I'm going to do it using dynamic script I'm going to add to discrete 181 boundary values okay starting from actually tomorrow going backwards 181 days once I'm done with that let me execute the script okay so basically I have the last 180 days plus one empty petition for prehistory so everything before 180 is another partition and one empty budget for the next day for the future actually for everything starting tomorrow going forward so I have a partition function the partition function is independent it doesn't has it still has nothing to do with it with the table itself okay I still need to connect between them next I'm going to create a partition scheme the partition scheme Maps each partition to a file group in the database so here I can basically I can say that each partition goes to a separate value or like I did here all partitions will go to the primary file group okay so I'm not using file groups in this demo it's possible to use different far groups and maybe to put them on different locations different this different devices and this is something that maybe it's a good time to talk about storage tiering okay so one of the things we can do when we use partitioning is to put the recent partition that we use heavily on fast storage SSDs for example and the cold partitions the ones from the past that we access rarely we can put on different slower storage we can do this by manually adjusting them from sequel server by using different file groups but this is quite complex to to achieve because we need to move data around ourselves another way is to use storage that has this capability inside are all kinds of storage devices that have storage tiering built in and storage device itself will move data around based on the the frequency of accessing and using the data okay but again for the sake of this demo of course I'm using my laptop and I'm going to use all partitions on the fiber II primary file group okay now I need to convert my single partition table to a partition table there are several ways to do this one way is to create a separate table with the same schema which is going it's empty in the beginning and it's going to be partitioned we will set up this separate set table as the partition table using this partition scheme a partition function and then we can write a batch that copies the data from the original table to the new table when we complete copying all the data we can just rename the tables and then we have a new table which is partition this is going to be very long process if we have a very large table it's going to take a lot of resources for the server and also if we want this operation to be online then it becomes more complicated because we need somehow track changes while we copy the data and maybe perform a union between the two tables it's because it becomes much more complicated another way to do it is to create another table again partition but we will have a single partition for everything that happened to you today and then we can simply switch the single partition from the original table to the single partition in the partition table so we have one very large partition containing everything to you today and from today going forward we will start having partitions per day so this is a partition table but for the history for the past record we have a single partition for everything this will be a very simple operation then it's just a metadata and change it will not require a long process of copying data it will not consume resources but of course if we want to access history and leverage the ability to archive all data and day by day we will not be able to do so because all the history is now in a single partition so what we can do another way we can do it is we have a clustered index right now on the table we are going to drop the clustered index and then we are going to recreate it using the partition scheme so while we recreate the clustered index it is not going to be created as a partition clustered index and of course the table becomes a partition table and of course this is a very massive operation because it's a very large table it will take some time basically we have downtime of the table so this this is not a viable solution if we want the table to be online 24/7 we need some maintenance window and to the dissipation okay but I'm going to demonstrate this approach so I need to I have a primary key which is the clustered index right now so I can just drop the clustered index I need to drop the primary key first and then I'm going to create two add two primary key again this time I'm specifying that the primary key is a clustered index and I'm going to put the clustered index on this partition scheme with the date and time column as the partitioning column now something that you need to bear in mind that if you want a partition table you have to put the partitioning column inside the clustered index the primary key in this case so I have to add the date and time into this clustered index let's run this one okay so it's going to take some time again it is five million so it's not really very large in a much larger table it might take a while once it's done I'm going to run this very long query sorry about that I didn't say before but it maybe it's a good time to say I I will make everything available for download right after the presentation in my blog I'll remind that in the end of the session so you'll have access to everything that I show you here so this very long query accesses all kinds of system views in order to show us the partition in the table after this operation so now we can see that we have 182 partition we created 181 boundaries this means we had n plus 1 182 partitions we have the first partition is everything before we started to have data before six months ago and then we have partitions for each day we can see the starting value of each partition everything is in the primary file group of course we can see the number of rows in each partition if we go all the way to the end so this is today you can see that today the number of rows is smaller than the rest of the data coverage in the middle today okay and we have already a partition ready for tomorrow and this is the current state now of the table so from now on everything that goes into the appropriate partition based on the date and time starting okay now let's talk about the archive operation in this case I'm going to put for simplicity I'm going to just delete I want to delete I have to I want to have a job that runs every day deletes the oldest day the 181 and they before and also prepares a new empty position for the next day okay and this is something is going to happen every day so what I'm going to do in order to delete the oldest partition and do it efficiently I first need to do a to create a table which has the exact same schema as my table and not just the exact same schema in terms of the columns and the data types but also in terms of indexes it must have these exact same indexes I'm going to use the same primary key the same column columns as a clustered index and I need this in order to be able to switch partitions I call this web page used to delete and then what I want to do next I want to split the partition function I want to add a new empty partition at the end of the table for the next day okay so what I'm going to do here I'm going to first I go to this partition range values to check what is the highest boundary value right now in the partition function I'm going to show you this using this select okay so currently this is the date of tomorrow the highest boundary or the starting day time value for the last partition in our table and then I'm going to add one day for the next partition and I going to split before I split I have to run this statement other partition scheme next use I must declare what is going to be the next use file group for the next partitions and water quality because as soon as I creates a new partition by using the splits method secret server has to know if now a new row comes into the new partitions where it should be located in which phonebook so first I need to say the next use file group is primary again and I'm going to use the other partition function to arrange and end a new boundary the one I just calculated okay so I'm going to run everything now and if we run this B query again we're going to cease now we have 183 partitions and we added just another partition for the next day coming up to the tomorrow so this for the future and for the past what I'm going to do is I'm going to switch partition I'm going to switch the second partition because the first position is empty its prehistory okay so the second partition is the oldest partition that contains data that I want to delete and in order to do it efficiently I'm going to switch partition into my to delete table now the to delete table doesn't have to be partition it's not partition right now it's just a single partition table and I'm going to switch the entire partition number two into the table to delete this happens in no time because it's just a metadata and change and then I can simply truncate the table to delete and that's it we deleted all the data and now we are left with two empty partitions okay the first one this will be for now partition number two becomes empty as well and we don't know we don't want to leave all those 0 row partitions so the next thing to do is to merge the partitions again I want to calculate the boundary value of the second partition and I'm going to run a marriage range operation of the partition function to merge the the first two partitions into one because there is no data in them so there is no need to move any data around so this is again zero time operation and let's look at the result now oh sorry okay so you can see now we have a single partition again for the prehistory we start now from the 29 or 28 and let's just like before we have to empty partitions in the future so this is something I'm going to put inside a job I'm going to run this job daily of course we need to molecule the job to make sure it doesn't fail we might want to create several partitions in advance for the past for the next few days and not just one and at a time the all kinds of methods and approaches to that but that's the general idea and again this has been around since 2005 a new feature in sequel server 2016 allows us to truncate a single partition so instead of creating a table switching partition and then truncating the other table we don't need to do all that we can simply truncate a single partition at a time in a single operation so for example here I'm going to truncate my table with partitions to okay so I can track okay only the second partition and this will replace everything that I did before except for the marriage of course I still need to merge the first two partitions okay so now I did this and that's it it's done I can show you again that we have now two empty partitions the beginning and I still need to merge just like before we're in the same code as before and this is the end result of our partition table okay so we have a single empty partition the beginning the second partition now begins at January 30 so this is how we should handle partition sorry very large tables we should partition them of course this is an enterprise feature and we are now able to very easily perform archives or delete for single partitions and to manage data including doing storage tiering and things like that like I mentioned before so let me just get back to the slide so table partitioning like I said facilitates the data archiving scenarios you can optionally spread partition between different storage tiers and it will help not only in archiving but also in all kinds of maintenance operations and we will show later actually in the next part of this session things like we're building indexes not the entire table but partition by partition and like I said before it does not necessarily prove square performance so if you want to employ my partition because you want to improve the performance of your queries on the village table you might be disappointed there's not a way to go there are other ways to improve performance we're going to talk about some of them later on in this session another feature that helps us manage table which is a new feature in sequel server 316 is stretch database or stretch table so in 2016 we can stretch our table to the cloud we have sequel server on Prem and we have a local database here and again this is our very large table and we have hot data and code data okay so let's say we consider the last six months as hot data we want the last six months to be present locally on the local server and all the rest of the data before six month is considered cold data we need to be able to access the data for only once in a while so we don't need to hold all this data locally on our own storage what we can't do is to leverage stretch database to stretch the older code data into the cloud into an azure sequel database so we're going to set up an ad receiver and other sequel database and then by leveraging this feature we will be able to tell sequel server to a synchronously all the time go and archive the data move it to the cloud and the best thing about this is it's completely time span to the application okay so we don't need to change any code as far as our application is concerned we have a single table and it's all local and we can retrieve data from the table whether it's hot or cold data and if it's coded a sequel server will go behind the scenes grab the data from user from the cloud and bring it to us of course we're going to have some more latency in this case but if the idea is that we do it just once in a while as long as we query the hot data it comes from our local on-premise servers so it's going to be with better performance I'm not going to demonstrate a stretched state over here but I want to talk about it how we set up search database so if you think that this is appropriate for your environment and you can go ahead and start language and setting up a testing it so first things that I recommend is to run the stretch database advisor this is part of the upgrade advisor of sequel Server 2016 you can run this advisor and it will analyze your current database and identify candidate tables to be stretched tables and also identify all kinds of blocking issues some limitations that you might have that will not allow you to stretch your table so this is the first thing you should do once you identify the appropriate tables and you know that you need you want to use this feature so the next thing to do is to enable rich database or it's called remote data archive at the instance level using using the SP configure command once you enable that you need to set up the other server so you need to go and choose an existing as a server or create a new one of course you need to have an azure account for that and once in setup the other server you need to configure the firewall in the other server most of the things here can be done through the wizard in managing studio you don't really need to do everything yourself step by step of course everything can be also done using T sequel so you need to configure the firewall rule inertia to allow your on-prem server to connect and to access the other sequel database the next thing you need to do is to create a database master key and in your local database of course and you need that in order to encryption the credentials that it will use in order to connect the remote audio server you need then to create a database ghost credential which tells sequel server how to connect to the remote i/o server again this is going to be encrypted ID - sake and the next step is to enable remote at the archive at the database level and this is where you going to tell sequel server on that database - which as your sequel database to access and which credential to use in order to access the database then when you go to the table level now you identified a specific table in the database that you want to stretch to the cloud you can optionally create a filter function this is in line table values function in which you instruct sequel server which rows are going to be considered called data which will should be archived and what sequel server will do it will cross apply from the table to this function and if the function returns return the row for an existing row in your table the row will be archived and transferred to the cloud if not it's going to be considered hot data and it's going to be left in your local storage yeah you not have to create a critter function you can decide not to have a filter function which means the entire table will be stretched to the cloud and will be stored only in the cloud but still it's going to be transparent you're going to be able to access the table as if it was a local table it's just going to grab everything from the cloud each time so maybe you have a scenario where you already partition the table and you have one table containing only the hot data in another table containing cold data and you have some mechanism that moves data around from the old hot table to the code table so now you can take the code table entirely stretch it to the cloud and have the stretch database feature do it for you while the table still transparently appeared to the application as if it was local and the last step is to enable remote data archive at the table level by telling it which filter function to use if at all and also you have the option to configure an outbound data so outbound means data is going to be moved to the cloud this is the normal operation but you can also post data transfer and you can also specify inbound to move data back from the cloud to the local storage and you will do this in situations when you need to set up things from scratch you want to do some troubleshooting or maintenance and stuff like that the normal operation would be outbound all right so this was about manage in the table again one option is partitioning another option instead of partitioning is to use the stretch database creature in sequel 7 2016 Juli do we have any questions so far yes yes we do what question was is this a Enterprise Edition feature this database I know on what that came in when you were discussing the partitioning yes a positionings and enterprise feature actually in sequel server 2016 service tip one it became a standard feature it's also supported in standard but other than that it's only an Enterprise Edition switcher and another question are really it's a comment from Martin it you said that person partitioning table is not primarily meant to improve query performance that's not his own working experience he's seen in many cases where partition elimination in large tables exponentially improves the response time especially since sequel 2016 the query optimizer has become an efficient at reducing response time using partition elimination okay that's true so the concepts of partition elimination is when you will have Araki queries against the partition table and you have predicated filter on the partitioning column sequel server should be able to eliminate partitions and only access the relevant partitions to begin with so it doesn't have to access the entire table or all the positions this is true in most scenarios providers of sequel server 2016 this didn't work very well and and this is why in some cases you might end up in either worse performance as before because instead of accessing a single partition which is the entire table you are still going to access the entire table but doing it in many partitions and it's going to be even worse in 2016 indeed there are some improvements so position elimination works better and I agree that in some cases you can still improve performance and probably what what this gentleman said is that in their case it was very meaningful and yes this is great but again the the goal of partitioning is not specific to the important performance but we improve the management of data and the minute mechanistic data and even in 2016 there are many cases where you're still not going to see improvements in data it depends on how you write your queries and what your what your our business and IO are and like I said before you still can see situations in which performance will even get worse as before is there any impact index impact on the trunk case of a single partition or on a switch partition so the idea of truncating a single partition is again it's a very fast operation it's a minimal log operation not really deleting all the road and it's actually going to take all the indexes on the table they all must be aligned by the way you cannot do this if you have an online index and by aligning all indexes meaning all indexes on the table are partition in the same way sequel server only has to to drop the specific partition from each one of the indexes so this becomes a metadata operation and it's very first it does not impact performance or the indexes in area the way it's just dropped the relevant partition from all indexes is stretching tables a 2016 process only is what can you read it again is stretching tables a 2016 process oh yes yes it's a new feature in 2016 you cannot enable it in presence can we stretch a database to a different server on another cloud solution eg AWS not at this point I don't know if it's going to happen in the future because of course Microsoft wants you to do it with a or but at this point you can only do it with us with the natural sequel database specifically have you seen table stretching at work on a production environment yet you love the concept and just wanted to know how mature it is and how often other people are using it in the real world unfortunately answered answer is no I would really love to tell you yes I saw it in production but not yet I can't tell you how our maturity is it's a new future of course we are working on some projects right now that where we consider using this feature and we're testing it already I know some others that test this already but I still haven't seen it in real production environment no not yet and what should you take into consideration when you are trying to choose the range for partition is having it daily with over a hundred and eighty partitions really worth it I don't know I'm not trying to send a question I don't I guess the first part what should you take into consideration when trying to choose a range for the partition so whether it should be daily or monthly for example if you're trying to plan that that's the question okay so the main concern is how you want to archive the table if you want to archive it day by day you want to archive it's month by month how do you want to out they want to move data and manage data around so if you want to work with with daily processes it makes sense to consider daily partitions otherwise it would be something else okay that's all the questions for now thank you great okay so let's move on let's talk about how do we query disable and I'm going to talk specifically about the ascending T problem okay so let me set it up so for the sake of this the rest of this demo I'm going to I want to simulate the assignee key problem by doing the following first I'm going to delete the last week from the table okay let's go one week into the past okay we just inverted the time machine so now one big before which is deleted the last week and let's say that we updated statistics okay either manually or it was an automatic of the six again one week in the past and after the update statistics we now go again back to the present so we insert rows for the last week okay and I'll explain in a moment why I'm doing this okay so this is the setup so we have the last time statistics was updated one week ago okay but we have direct data up until now so now I have this test query okay let's say that I'm using this current application let's say that our hot data is the last week most users in most cases query for the last week although we have in this case six month of data maybe we have seven years but we need many than the last week okay so here I'm asking for the rows from the last week from the face use table and I'm joining it to some other table the reference codes table and I need to enable include actual execution plan okay so we get approximately 190 mm rows and if we look at the execution plan what you can see here so I need to show you first we do a clustered index seek on the page you stable and a nested loops inner join into the reference codes table we have another clustered index ik if we go to the clustered index if on the page you're stable what we see here is the accurate number of rows is one the optimizer so there is going to be only a single row in this table while the actual number of rows is this I'm going to show you before okay so this is au which get was complete wrong and because it was completely wrong it said okay if we have a single row let's use nested loops I'm going to perform this clustered index take only once but no in reality I'm going to perform this class at index ik 192,000 x which means we're going to have that much logical reads okay again multiplied let's say the table doesn't have five million i sedative five you get the idea is going to be very very poor performance now the reason this happens is because the last time we have the statistics was one week ago so according to the histogram the last date in the table is July 19 so as far as the optimizer is concerned in the last week done rose okay so this is why it estimates only a single row why in reality we had a lot of rows in there but statistics was just not updated yet okay this is the problem called the ascending key problem it's very common in most databases where we have such large tables with the ascending date and time column we will have the same problem the same phenomena and the reason is that prior to sequel Server 2016 the threshold for automatically updating statistics by sequel server is 20% meaning if we have now 1 billion rows for example and we have just updated statistics we need to have 200 million rows to be updated or inserted before we have another automatic statistics update and this is Mike it's going to take a while ok probably more than a week so we're not going to have an updated histogram for quite a while and the optimizer will going to get it wrong and produce poor plans ok so this is how it works by default prior to sequel 7 2016 1 thing we can do prior to secure service 2016 is to use straight-leg - 371 this trade select changes the behavior of this threshold the algorithm it's not 20% anymore it's a percentage that gets lower as the table gets larger okay so with very large tables we're going to have a much smaller percentage mainly the frequency of other statistics is going to be better we don't have to wait now a few weeks or maybe a few months until the next auditor districts so it's going to improve things a little bit and in sequence of it for its extent this has become the default behavior so if you use comparability level 130 which is simple 7 2016 you have the default behavior of these threshold not 20% but something that gets lower and lower as the table becomes larger now what I showed you I'm using sequel Server 2016 but I use right now for the demo the compatibility level of 110 which is sequel server 12 on purpose because what I showed you now in how sequel server 2012 behaves and of course pirate versions if we change it to sequel server 2014 or 16 in this case 26 over 2016 one of the changes starting with sequel server 2014 is that now we have a new cardinality estimator this is part of the optimizer this is the component inside the optimizer that estimates the number of rows in the table using statistics and for out of sequence ever 2014 it was the same component for many many years and one of the problems is the extender key problem where this component estimates one row because it thinks there are no row rows in the last week so there were some improvements starting with sequel 7 2014 we have Avenue community estimator and now when I'm using the new mobility level if I run the same query without changing anything else we're going to get hopefully a much better slide the same number of rows of course but now we have a hash match dinner joint ok and if we look here we now see that we have a much better estimation estimation it's not an it isn't accurate but it's much much better than before ok it's not just one it's a much better estimation and now the optimize the node it should use a hash map joint and not nested loops so we did this by just upgrading to sequel server 2014 or 2016 and changing the compatibility level not doing anything else now the optimizer does things much better now you have to be aware this is very important that with the new Coderre ta simulator there are some drawbacks well there are some scenarios where you might get a worse plan than before then with sequel 7 2012 and it's quite common actually with our clients when they upgrade to sequel server 2014 and suddenly they encounter poor performance and they don't understand why and it's because of the new color estimator so you might get better plans in some cases you might get worse plans in some other cases in the cases where you get worse plant it might be dramatic some sometimes so of course what you need to do is to optimize those queries and improve them and make sure that under the new estimator you get a good plan but as an intermediate thing you can do you can use traits like 9 4 8 1 which tells the optimizer to use for this specific query to use the old estimator so although I mean the new quadratic differ later so the identity and compatibility level I meant to say I'm going to use for this specific query the old estimator which will produce the sameness to look as before okay so for those queries where you see that they behave not as well with the new estimator you can instruct the optimizer to use the old estimator query by query ok now of course is if still we have a problem with the frequency of the update what we can do and in many cases this is what we should do we should create our own job that maybe runs once a day once a few days and update statistics manually ok we can control it is to do it with a full scan for example if we run it for the full scan its mean we scan the entire table we get a more accurate histogram but this might take a while it might consume resources if it's really a very large table this might be not practical if I do that so now we can see the statistic is updated as of the days of the last value in the table ok and now if we run the query again we will get more accurate results so we get the hash match again sorry and now we can see that we have a much better value here ok so we can use it with sample rates we don't have to use full scan if it's too resource intensive but then within sample rate of course it won't be as accurate maybe it's good enough maybe not easy to test it on your environment but those are some options you can do as well let me go back to compatibility level 110 which is equals Server 2012 and let's simulate the same thing again delete the last week update statistics last week and then insert the last week again into the table so again since we're using 2012 now we Delta trade select 2 3 7 1 we need to weigh 20 percent so we're going to have another update statistics on the table after we insert the last week and let me prove it to you to be see show statistics again the last value is July 19 ok a week ago ok this will just show that we go back to the estimation of one with the nested loops plan alright so another approach this is something we could do before actually since 2008 is to use filtered statistics and this works pretty well well in some cases it depends what I can do here is I can create the districts on the date and time column and I can add a filter a where clause only for the last week so I want to create statistics only for the last week it's going to be very very small compared to the very large table so it's going to be very fast to use the creative statistics I can use full scan without fear and it's going to be even much more accurate because it's just the last week the problem is you get an error we cannot use a non-deterministic expression inside the filter statistics and this is an ultimate expression okay so what we need to do is dynamically calculate last week put it here as an half a coded value and then dynamically created statistics and this means that the statistics statistics are going to contain the last week in every point in time it's the last week now but as we go move forward into the future it will contain more and more because it's just all the data since specific date ok so now if we look at these two ground with a much more accurate histogram contain in only the last week now I'm going to run this query again I'm still in sequel server 2012 compatibility level at this point I'm going to run this query and I want to show you that it's not statistics it still thinks it's going to have a row and another thing I want to do if I use dynamic sequel with parameters or maybe in a stored procedure with parameters so you have here a parameter for the next the next class with curry because we have parameters in here sequel server still can't use the filter to dipstick so in both cases sequel server will not choose filter statistics in the prior case when we did this because of this expression sequel server can't use with the statistics it has to have a deterministic expression in the query as well to be able to use the filter statistics so this means we have to change our queries to use hard-coded values dynamically and in this case when we have parameters sequel server has to produce a plan and put it in the plan cash so it can reuse the plan for different parameter values and since it cannot guarantee that for each parameter value it will be able to use the filter statistics it will never use for the statistics when parameters are involved so what we can do if we change our code in the application and then in in the query we're going to the hospital value which is contained in the last week the now sequel server can identify that it has a filter statistics covering everything it needs and it could use it can use this for the statistics and produce a very accurate actually this is an exact predict estimate okay we can use a flag 9 204 which is undocumented this trade flag will show us which statistics we use in the optimization process so this way we can make sure that we are using the filter statistics if we look at the messages ok and here it is we have the filter starts loaded so we know that we use the filter statistics and if we use the non terrific expression as before with the same trace leg we will see that in this case where we get the nested loops planned we are not using the filter statistics okay so it's not it does not appear here now if you go back to sequel server 2016 in 28:16 now I run the same query which has an automatic expression one of the improvements in 2016 that sequel server can handle those situations and if then even if the expression is deterministic sequel server can use the filter statistics it can identify that the actual value is contained within a set of statistics and it can use the filter statistics so this is something that is new in 2016 and if I want to previously with race like 9 204 then unfortunately we will see that nothing happens with this bracelet so this straight slug is not supported in 2016 and I guess it is a good demonstration of what undocumented means ok but don't worry we have another trace leg eight six six six we're going to enable this straight leg and run the query again and what this phrase like does it adds debug information to the query plan so now we can go to the select operator go to properties and here we have internal debug information okay so I'm going to show you now because it's not it's quite agree but somewhere in this text you can find this this excuse and you can admit if I can just just search for it and find the food statistics being used in this query okay so this is if you want to make sure that it was used let's turn off this one and also in 2016 if you use parameters like in this search or install procedure then this is again some type the sequel server 2016 can handle it will use for the statistics and if you're going to use the value that is not contained it will recompile the plan and generate a new one okay so again we have a good estimation here and we have a hash match in this case let's drop the fit of statistics and that we summarize what we did today and we'll continue the rest of the demo in the next part of this session on August 9 okay so to go back here so this is the ascending key problem and some solutions we saw this problem is to update statistics manually with a job once a day once whatever either with a full scan or with the sample we can use trace flag to 371 if we use the version before second Server 2016 in 2016 this is the default behavior the new car at the estimate which can do magic in some cases and it for performances and the plan like the case of the ascending key problem beware be aware that it can also make some query plans worse than before you need to be aware of that and test your application when you upgrade to sequel 7 2014 or 2016 and we can also use filter statistics as I just show you of course with filter statistics since we only use the Tamara stick expression now we need to have some maintenance job that runs daily or maybe weekly and replace the filter statistics with a new one for the last week again and again okay we can't just create it once and it will not contain the last week it will contain more so we need to maintain this solution any questions there's just one more question guy the auto create statistics is on by default on their system he's just wanting an understanding of why we're updating the statistics again manually ok so by default the database is configured to update statistics automatically and like I said before 2016 the threshold is average 20% Rose changes in the table and it might take a long time until the next time statistics are updated automatically so maybe we want to increase the frequency and you want to control that we can create when manually do it let's just say once a day another reason is that by default the automatic update sophistical execute server is done with the sample of the data and not full scan and if we want to update this with a full scan to have more accurate histogram then we can do this with the manual operation and that's it for questions again I'd like to remind everyone that this broadcast was recorded and will be posted on the DBA virtual chopper archive page and please join us on August 9th it's noon Mountain time for part two I hope to see you there guy we really appreciate your time on this excellent pretty very much again I'm going to upload the ultimate serious to my blog it's right here on the slide so check it again in a few hours or maybe tomorrow and you'll be able to download everything from there thanks again guy and guy if you want to send me either this the slide deck or this information I'll then try to post it also to the website excellent right thank you very much thank you everybody for attending and we'll see you on August
Info
Channel: PASS Database Administration Virtual Group
Views: 12,933
Rating: 4.6862745 out of 5
Keywords:
Id: NKtwMOU_KnI
Channel Id: undefined
Length: 65min 3sec (3903 seconds)
Published: Wed Jul 26 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.