Azure SQL VM Reimagined: Measuring Performance & Monitoring Health (Ep.3) | Data Exposed Live

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hi i'm anna hoffman and welcome to this episode of data exposed live we are super excited to have you joining us here once again on this wednesday uh where we're going to be continuing our azure sql virtual machine reimagined series so without further ado i'm going to bring on our guests our first guest is uh david pless who's been on the show a few times now um david thanks so much for joining us um can you tell us a little bit about what you do well let's see i i cover azure sql virtual machines sizing and storage the storage engine is a new area i'm focusing on and performance in tuning and optimization awesome cool and we also have pam joining us today uh hey pam uh can you uh tell us a little bit about what you do sure uh i'm the lead pm for sql server and azure vm um and i work with david and a few other pms on just kind of maximizing everyone's experience uh when running sql in azure vms awesome cool so it seems like we have the two smes that are really going to be able to help give us kind of an inside look at getting the most out of your vms um so i think it's gonna be a really good episode and the topic for today is measuring performance and monitoring health now i did mention that this was part of a kind of ongoing series that i'm doing uh with the vm folks so pam can you tell us a little bit about like if people haven't watched the series what happened in the previous episode and what we're going to talk about today sure absolutely so one of the most common challenges we see in azure vm is customers that move from on-prem have some trouble um sometimes choosing the right vm choosing the right size setting up storage and just making sure that the vm is configured in a way that will you know set them up for success and make sure that they get the best uh performance and the best price performance ratio and so we've had a couple of different series we had the um just kind of the overview on the experience of running sql server and azure vm and then we did one on sizing and how to choose the right vm for your workload and uh we also did a deep dive on storage and talked a little bit more about how storage works especially um the thing with storage is it's so much different than what you do on-prem um so those are the first three that we did and then today david will talk more about uh monitoring for sizing again we're kind of going with that same theme of of how do you get the the biggest bang for your buck uh running sql in azure vm cool i think that's something that like a lot of people want to know how to do for obvious reasons um but when we talk about monitoring to kind of get into what we're going to be talking about today like what are we monitoring and why are we monitoring yeah that's that's a it's a good segue to the uh core topic in in when we get into the performance insights we have a demo and performance insights we're also going to be uh like uncovering some of the storage topics as well i just want to say that really quick since we're on the topic um so on the question of what are what do we want what are we monitoring and why are we monitoring we will uh we'll be monitoring metrics and performance at the core resource levels like cpu memory and definitely storage because we want to size for storage this is going to be from our source platform but also taking that same monitoring and following our deployment to azure and monitoring there as well and the only thing that will really change in some regards is the methods we use to monitor the the reason we monitor the source is that we first need to find a size that meets our needs and a large part of this is like total cost of ownership and being able to to make sure we are within budgets but also to make sure that we don't need to resize and that requires that we continue to monitor and so a lot of like our ability in the cloud is to be able to have the deployment that we need that meets meets the performance at the performance levels that we have but also be able to resize those scenarios and make sure that they're in line with our workloads so the first and most important step for migrations is monitoring that source performance and the other aspect is monitoring where we land and we have to do both what we do on prem for sizing and that's going to carry over to monitoring how we're sized in in azure okay so and again kind of going in along with our sizing methodology okay so when you size your application we covered a lot of this in the other session so i'm going to just touch on this here but we really need to think about what we've purchased and sized on prem and why and this really comes down to the capex versus opex scenarios and because we are in a scenario where we have so much hardware more hardware than we then then we may necessarily need we're sizing for those three to five year cycles where hardware depreciates what we often want to have is a scenario where we're using our budgets before we lose our budgets we hear we hear that all the time additionally we often pull our budgets for larger richer strategic purchases on-prem like replacing the sand so we often lean towards these five-year budget cycles and basically because we purchase in this way we're purchasing for longer cycles this can be a it's sort of a challenge when we're comparing what we're doing on prem to the cloud because we certainly don't want to go in and find a host that is sized in the same way that we have on prom right so like you know when you're looking at your page life expectancy counters or you're available uh what's available to you to the machine we're not we're not trying to achieve the highest score like you would get in an arcade game you're trying to really you find find a a a set of resources whether it be memory cpu that meets the the workload as much as possible that can also handle those spikes when those spikes occur and this is why we want to size and monitor and then continue to monitor in the same way that we found from our source platform anything to add on that pam um anything that i know i'm touching on this and we've covered this before yeah no i no i think that was good i mean the main thing the main point you made was the one that i was you know going to make that you you don't necessarily you oversize your hardware on-prem you don't want to oversize your hardware in azure because you're renting it effectively right so you don't want to pay now for hardware you need five years from now you wanna pay now for what you need now in five years you can you can resize that's the benefit you get in azure is you can resize as you go when the cpu starts running hotter and hotter and hotter you can size up so it's best to not necessarily look at what you have on prem for hardware but look at what your workload is using and that's what this is uh this presentation is all about is how do you figure out what that workload is using yeah i describe it as flat you know flying close to the sun and i mean that in terms of not not having too much you know space that you're giving uh you know resources that you're you know that you're not paying for so you want to make sure that that you're doing that so we've covered this um before i'm definitely going to just lightly touch on this um we definitely want to use our memory optimize series compared compared to the general purpose and confidential computing in storage optimize those are the the three series that we really pay attention to but we want to target our m series the eds v4 and the ds v2 11 15. notice i'm kind of starting on the eds v4 memory optimize is going to be best for your enterprise scenarios your general purpose and confidential computing that's going to be your um let's say departmental workloads and storage optimized is going to be for your big data etl data warehousing and what's unique about that is it has an extremely um extremely powerful ephemeral uh scenario where you can get extremely strong i o and throughput capabilities however it is ephemeral so if you if you were to deallocate those core drives uh uh you would lose the data on that on those drives so it would need to be a scenario where you can recover the data from a source platform such as another rdbms or potentially like an azure data lake and these are the the machines that i'm going to break down from from a memory optimized perspective and again uh tagging the eds v4 so the the this these are the memory optimized series so so these are the ones we want to definitely target but you have that you know um consistent uh memory ratio of of for every uh core we have uh seven gigs of ram across the virtual machine all of these machines are going to support uh premium storage and premium storage caching the eds v4 we we really like this this series a lot um it has a very consistent mim to core ratio it really comes there's a couple of things that i like about it one is that scalability across the table so if you're trying to scale up and down it meets it meets your needs quite well the other scenario is the iops and throughput at the storage level is really really strong the m series if you need the basically i'll i'll say it this way if you need the memory if you need the memory look at your m series and this is going to be more of your mission critical and data warehouse scenarios now that that's uh like in summary like if you want to if you want deeper information on sizing please see the previous data exposed on sizing um pam anything to add to this yeah no only one poor point i wanted to add and that's if you're very price conscious um we find that that eds v4 series is the best price performance ratio for sql server workloads um so not only is it great for the scaling yeah exactly it's priced so much lower than the m series because it doesn't have quite as much memory as the m series but for most sql server workloads it has what you need we find that that one to eight ratio is kind of the sweet spot for most sql server workloads so unless you're talking about a data warehouse or a really really really high-end oltp system that eds v4 series is probably the right one to go and it's the one i would start with i think if you're not sure where to go in azure that's the one i would start looking at no that's definitely really helpful especially because like you guys already kind of narrowed down there's a gazillion different options so you guys have done a great job narrowing down and then if you're still like i'm not really sure then you kind of uh led us to the right direction at least where to start i know there are other considerations around like your data your log your temtv um right you know it would be great if you guys could dive into like how we think about making decisions there yeah we've we have i have a uh that's kind of the next step is to kind of break down our storage uh sizing decision tree but what did we have before was it like 380 some odd vms that we broke down and three three 375 i think there's three hundred thousand 375 different pms yeah yeah yeah so if if if for those that are interested in how we broke those down synthesizing the inside session okay so let's talk about the like storage sizing decision tree um these data points for data file log in tempdb we leverage this for both our our tooling scenario as well as how we're taking the approach of getting the storage identified so so let's talk about uh the data file first so we're going to walk through the methodologies of sizing for our data files and this is the mdf files so really it comes down to if you're doing a current migration or not if the answer is yes then what is the size of the data files so we size based on capacity iops and throughput this is the way azure managed disks are offered so and you can think of these as the knobs that we can that are used to select disks where any of these scenarios can push us into the next tier if the answer is no i'm going to tell you to follow the documented approach for sql server best practices we've we've overhauled the documentation there uh steer towards that and then also take the recommendations from your vendor um hopefully your vendor has the ips and throughput and capacity needs based on you know the number of users you have the every app is going to differ but take those recommendations and apply those if you know from a high level we we recommend starting with your p30 p40s aim for capacity and then leverage our memory optimized virtual machines the next step is going to be applying your iops and throughput if you've already have the ios and throughput captured then you can match those capacity needs the number of discs for your minimum vm option on the table and your eye up and throughput needs to a combination of again those p30 and p40 discs in a storage pool the the storage configuration template makes us very easy to achieve for our marketplace image we actually we covered this topic pam i think we covered this one in the storage in in the storage deep dive that we did yeah we did in the storage deep dive yep yep now if you don't have the iops and throughput then you're going to need to obtain this information this is a little bit of what we're talking about here in this series is uh is is getting this information the best bet still is leveraging your performance counters and and it is recommended to to leverage a tool or process that can capture for identified workloads like nine to five nightly etl report hydration activities end-of-week processing and um pssdiag perf insights perfmon they're all good methods to obtain these counters really i don't so much care that what tool that you use as long as it's lightweight and that you're getting the identified workloads okay now once you get your performance counter data this is going to map you back to matching your eyepin throughput needs to the p30 and p40 discs in the storage pool for your data files this is going to be with read-only caching and we always recommend targeting the uncached iops and throughput in your vm in in your azure disk documentation this is where the cached iops and throughput can be seen as a safety net to handle any kind of bursting so you can weather any any busy periods this will help prevent you from having to constantly resize to address those busy periods you need to understand the customer workloads and again know the workloads and apply counters to azure disk and throughput recommendations and we make this easier in their tools and we'll be covering these uh covering these soon so again you know you're taking these scenarios in in through the storage configuration tool we're going to get you into a storage pool and we're going to have that read-only caching enabled and actually when we look at perf insights you'll be able to see where where our storage pool alignment and some of the um some of the storage best practices are applied now this is data files now for your log file your transaction log file that is this will have the same performance counter scenarios but how the recommendations are applied are going to be different again it does come down to whether you're doing a current migration if the answer is yes then does your workload require single digit latencies for the transaction log most production environments do not have this requirement but those that do excuse me such as like high-end public-facing otp environments like ticketing gaming mainstream retail systems they have unpredictable large traffic patterns and and there can be storage configurations or considerations that is that that need to be employed to address that something something to be uh something to be aware of if you do require single-digit latencies then you then you have the um you have a couple of options here um you could um you could use the right accelerator if you're using an m series or you could use ultradisc now again i for i i don't see most customers having this situation but when you need it you need it and you know that you need that low latency scenario so um if you're using ultradisc um ultra disc this is like a a single large disc that's going to meet your space needs it has separate knobs for iop throughput and capacity it's a little different than your premium disk you don't need to have multiple disks and often you don't want to have multiple disks because you're trying to lower your discount to stay on the table so if you have too many too many discs that could it could force you into a larger vm than you other otherwise would be targeting something to keep in mind um the right accelerator is only for the m series so if you're on the m series recommend using the right accelerator if you need single digit latencies otherwise use ultradisc and then this is going to push you into again the storage pool because we don't need uh caching for uh for your log files you can go from a p30 all the way up to a p80 and again why would you have larger capacity scenarios and go into a larger drive it would be to reduce the the drive count and you would use um either ultra disk or write acceleration if you need that single digit latencies for the log file now what do the question david um i was just curious like for the m series why isn't there kind of like a branch that lands you into a storage pool oh that that would be the same it would be the same for the m series as well the right acceleration is not a i'm glad you brought that up it's not a disc technology drive acceleration is a technology that is specifically for the the virtual machine type and it's only for the m series so you're not going to go find a right accelerator disk it's a it's something that is tied to the capabilities of the virtual machine uh pam anything you would want to add on here before i move to timdibi or do you think no no yeah no i think i think you've covered everything there's just um just one thing i want to note on the latency actually um i just remembered that um you know latency for storage is is definitely going to be higher in in azure than it is on-prem it's just the nature of using remote storage um so i would say don't think yeah so try not to fixate too much on i need to match the latency i have on prem because again the hardware you have on prem is probably like like you know super fast and a lot bigger than you need and all of that and so you're probably not going to see the same latency you'll see a little bit higher latency in azure however that's not necessarily going to impact the performance of your application so we've had some customers where they're like oh our latency is so much higher here we're doing all these uh you know they're using like i o uh tools to test they're not using sql server they're just using these i o tools and they're saying oh the latency is so much higher but then when they go and actually test their application they're like oh but we're not actually seeing any problem with the application it's like yeah sometimes that disk latency isn't really a big deal because of the caching available in sql server because of the caching available you know if you're leveraging read caching and all of that even caching in your application that latency is a lot of times buffered so try not to fix it too much on the numbers fixate more on your performance numbers from the application make sure you have good load tests in place and that you're measuring that latency and don't worry so much about the um you know the direct latency to the storage right and then and also think about application latency overall you would you can address that in different ways in the cloud as well right right it's more than just storage latency yeah exactly exactly yeah so um last and certainly not least because tempe if 10 phoebe is six equals sick right so that's kind of one of our hercules so for for tempdb it really comes down to size and and it all depends if tempdb can fit on the ephemeral disc because we're going to tell you we've done plenty of testing on this we want you to target the ephemeral disk for tim db i always recommend getting 10 db placed on that ephemeral disk if possible the only reason you might want to consider placing tempdb on remote storage along with the data files is if you're if your virtual machine does not have the capacity to store it on the ephemeral disk and then it's counterproductive to increase the size of the vm just to get to a larger a larger uh ephemeral disk and so there's a number of ways that you can that you can track that i really like using the management studio report disk usage um and just know that um you know you're what you typically see from temp db you're just going to go right back to that size after a recycle so this is something that you should you should be monitoring and looking at the the health over time or the size over time so otherwise uh there's a couple of options that you can that you can choose first if we want if the best scenario is if we can choose a vm that has enough space to target uh tempty beyond that ephemeral disk if not you're going to place tempdb on azure premium ssds that's your p30 and p40 and there's possibility and i i have this for completeness this is not necessarily what you would what you would see in in most environments but like if you're a hosting company and there's there's let's say a database for every user and you're in you're using virtual machine the lsv2 series may make sense because of the large ephemeral capabilities that it has just be aware that these uh this vm series does not support premium disk caching okay so let's talk about um oh i should really change is there any any other things that we want to add here or should i continue now i would i would go forward just to say that the temp tb configuration is the same in vm as it is on-prem the difference is just where do you put it and the d drive is perfect so and it's you know it's it's there and it's free it doesn't hit your your remote cap your remote i up so that's the key point that's the point i didn't mention is it's not going to count against your uncached iops which is really exciting yeah so it saves that bandwidth for your data files where it's important yeah so let's talk about metrics for performance performance monitoring methods to capture i'm just going to touch on this lightly because we have covered this before just just just note that you know iops are the number of requests that your application is sending to storage in a second and you're going to map this to disk reads per second just writes per second in perfmon we do have a slide that covers the counter so that's like you know that that camera slot type of slide so not to worry about that throughput is the amount of data um read from a written two per second this is actually the data that we're moving and you can map this to disk uh read bytes per second disk write bytes per second the i o size you know sql server is going to control gonna control the i o size it is something that you can monitor though you can monitor this the average disk bytes uh uh red and average disk bytes per right in perfmont excuse me and you can also uh uh get the latency in the average to set for read and average the second right um and monitor monitor that but it's really important to monitor both i guess really the key thing here is monitor both the iops and throughput and fortunately in uh perfect diagnostics that we're going to take a look at we map this to your azure disk scenario as well so we'll take a look at that example so for your key counters and um i'm going to go back to the point that uh that i mentioned before is your your monitoring for your worklet we're really kind of looking at the work set in and what sql server is the space it's consuming i'm not worried about taking what you have on prem and dragging that into the cloud that's counterproductive what i want to do is find the best possible space whether it's platform as a service whether it's a virtual machine and be as close to that to that resource configuration is possible that way you get to get the best value and the method and seriously pam there's hundreds of counters right i mean oh yeah i care about this about and and and even this i'm going a little far right there's about it's about 20 20 counters that you really care about if you get this this is all you need yeah yeah i agree so one thing actually david can you one thing i wanted to mention too yeah you don't have to go back but i mean the other thing too is to measure if you can measure um data files tempdb files and log files separately so if they're on separate drives make sure you monitor them separately because they're going to be targeted three different places and they're going to be um sized three different ways so you you're going to size differently for tempdb versus data files versus log files so it's important that you measure them separately so you know the read throughput and the right throughput for each type of file so that you can size each area appropriately so so don't just dump everything all together and measure one number for sql make sure it's separated yeah good call so how do we get how do we get this data um many of you may have had a css engineer and run pssdiag in your environment it's really to see it it's just a different template for sql diag and that's public publicly available that'll grab grab your perfmon counters for you you can use sql nexus to analyze this output you can go in and create in uh on windows set up perfmon to create that blg file locally and on a schedule and then you can use uh pal performance analysis of logs to to analyze it again that's another analysis tool um your data migration assistant dma will will capture an aggregate of the counters we'll take a look at an example of that at the end your perf insights will grab the the blg file as well and do some analysis on that that's in the source output and of course like i said before i don't really care how you get it as long as it's relatively silent and and uh not intensive to the to the machine that is being tested um don't forget about sql insights and i think you you have a session on sql insights that you have recorded recorded in the past um that's something that you could be that could be leveraged to for post migration scenarios and um i mentioned query store here i i mentioned it i would i would say that it's really good at supporting information to capture for any uh uh query performance over time and and think of uh capturing your plan health and it follows the database no matter where you move it so you you set it up on prem it's this is essentially what it's designed for you enable it on prem and then when you do your migration you can continue monitoring and see if there's been any changes to your plan health post migration okay so really lightly i mean tune first or migrate um or migrate now this is a classic trade-off and it really is going to come down to skill risk control time and money do you have the skills to complete the work we need to bring someone in what's the risk of making changes while the application's in flight how much control over the application do you have to make changes if it's a vendor app you may not have as much control as you would like um something to keep in mind so you always want to work with your vendor do you have time to complete the work we always recommend pre-post migration maintenance but we're talking about is the time it may take to investigate a performance and issue deal with change controls dev tests and things like that and then you know then money you know it costs money to tune in time is money so you may have the skills but the dba's time is often monetized to some degree and there's certainly cost on the consulting work if you need to bring in a resource so you may identify that it's cheaper in terms of azure hybrid benefits and reservations just to use the solution as it fits in your target platform so these are just some considerations to to have to to think about do you tune first or do you migrate you know my feeling on that with the dirty garage slide right oh wow don't move don't take your junk with you when you move don't move your knee up first clean up first yeah if you can if you can that's the preferred that's the preferred method because you would be surprised tuning five or six queries maybe you can go a whole smaller size on your vm maybe you can reduce your iops and and reduce your costs you know by a huge amount just by tuning like your top 10 queries for example so it's worth spending a week or so saying can i just add a couple indexes and now suddenly i can get a much smaller vm so it's worth looking into uh at least a little bit yeah i would just uh echo that i always remember a few years ago i attended one of joe's acts presentations at pass summit and he said there's a lot of you so people watching the show that can save a lot of people a lot of money in azure so i just always remember that when i think about how important it is i love it i love that analogy because you know i've seen people move and that's the first time that they've ever cleaned their garage [Laughter] [Music] okay so so we're going to talk about perfmon and perpin sites now a lot of folks will start perfmon this way right they'll just go into perfmin and you know the first counter you get is your percentage of processor time right and then you go through the activity of right clicking and searching through our counters and i mean even even pam and i we've done this a thousand times we still get we're still hunting and pecking for counters right so if someone says hey there's something going on if you're from a troubleshooting perspective or from a monitoring perspective so a lot of what y'all are going to be doing is baselining and monitoring hey how does this look right now right i don't want to be hunting for counters so what i'm going to tell you to do is instead of doing it this way open up a command prompt and some of you may know about this already but launch perfon with slash cis sys and when you do that you notice that all of a sudden i got i got a different view i got this in this report view and i already have counters populated well why is that perf mine slash sis will give you the last view that you had open but more importantly it will allow you to save the set you can click save settings as and then save a template okay so what i've done is i actually have several templates here of the counters i care about for cpu io and storage memory sql statistics temp db in my weights and i have them set up to where i'm on the view that i really care about so for weight statistics i really kind of like i like this view or i like the um line chart view right um for i o and storage i would really prefer actually i prefer this view report view right um cpu i may i may have a different view but it allows you to quickly and and i'm i'm populating the stuff already you see i have all this workloads running down here making my box crazy so that's that's what i'm going to tell you to do to try to make this a lot easier for you all to to deal with from a like a quick at a glance kind of view now this is not going to save your blg files but it gives you this nice template approach to where as as you're looking at workloads you may run the workload for five minutes you may run it you know you may have like these these smaller tests that you're doing this will help you help you along your sizing activities looking at the counters that we're looking at there's other switches as well like you know slash rel for reliability monitor slash res to look at your resources but slash sys is what will allow you to save the local template file okay so let's take a look at the other scenario that i wanted you to look at is actually i may have it already open i do okay this is the perf insights output and anna i think we have the the um documentation on perf insights this is really easy to run you're going to run this through a command console and the first thing that you're going to see when you open this up is your key findings like from a performance perspective like what's going on what's your health low medium and high scenarios here you know there's two scenarios that are high two medium for cpu one low for windows i'm not going to focus so much on this i will show you where it's at like if you click sql you can see there's some right some key findings like databases are located on the physical disk contains a mixture of data and log files or sql server instances are configured to use unlimited maximum sql server memory we don't recommend that um but what i'm going to focus on and this is because i i'm looking at this from a from a storage perspective i told you earlier that we would we would hone in on this so i click storage and i look at my high resolution disk usage you can see here that it breaks this down into i o throughput and latency and i have my disks that are are set up for these on this line chart so i can just remove all the q depth stuff you don't need to see that in the chart it makes it a little busier and i can see where my workload my workload was just going just above this this line here that orange line would be my p20 disk so if i choose a p30 for my workload here i am perfectly within that range this is a really good way perf insights is a really good way of kind of seeing how your virtual machine is going to fit into these work into your disk configuration you can also do the same thing for throughput so i'm going to remove all the q depth stuff again just because i want to focus on the throughput on the on the graph and i do have a spike okay so i'm not going to tell you to tune for spikes we can handle the spike try to troubleshoot and figure out what's going on with that but again you can see that i'm within this range here otherwise there's a little bit of a little bit of spike here that's going on but i can see that and you can see the throughput 200 megabytes per second right here that goes along with what my p30 if you remember from our previous sessions that pam and i did and for those that for those that are really honed in on latency really quick this is your latency chart at the bottom and we said not to focus on this as much but if you need it hey if you need it you need it and and then this is where to analyze so so go ahead anna sorry no i was just gonna make sure i was reading this right so all the horizontal lines these are the different thresholds that you get with a certain storage level and then what i see there's like blue there's purple there's some different different workloads being monitored down at the bottom what are those just different workloads or how do i think about those yeah these are the different performance workloads that are that are popping up here um different processes that are popping up and then the what you're seeing in the horizontal lines is the is the fixed scenarios for these particular disks okay so those are these are those are separate disks on your host machine though right right each one of those is a separate disk yeah pricey i see right and i have workloads targeted hitting these separate disks got it okay and then though so what you were saying earlier is like under the p30 horizontal line what we're seeing is like that's where most of my workloads are within that range so i can take this and say p30 would be fine right got it right yeah cool yeah and i'm using this speed this speed to hit these separate physical discs cool this is a useful tool absolutely and and um on top of that now you can from a configuration perspective and i'm just kind of like cherry picking some specific scenarios like storage pool you would have to go through and run potentially some power shell go through the gui i i really like that this brings everything together so from a storage pool perspective i can see what my configuration is for this particular e-series machine so um i'm zoomed in a little bit a little bit here but you can see that the interleave is 64k you can see the discs that are a part of this particular storage pole and the size you know when you zoom out the columns won't be much together like this you can see from a volume map perspective what is what is temp what is os what is a data disk and what polls that they're tied to and the number of disks within the poll and the disks that are attached so down here this is my my transaction log this is my data files you can see the number of disks that are involved in this so there's eight drive eight disks for my uh data files there's two for my transaction log and then here's how the disk map is tied this is my the physical disk to the logical disk um is it a managed disk or not and it in is this is probably the most important pam this is really good right yeah with the cache my host caching yeah that's great so i would have to without this i'm not saying that there aren't other methods of doing this but i really like proof insights for this for this reason not just to get a health overall health check but to also get an idea i mean my findings wise i'm perfectly healthy but it's really nice to know what my my disc resolution um mapping to disks are and then what my storage pole might how my volume is mapped and how my discs are mapped to those volumes so david if you were seeing any sort of disc level or vm level throttling would that show up here in the findings you would you would see that pop up in the findings yeah and you would you would also you would also it would stick out like a sore thumb on your high disc resolution report because it would be just like pegged at the p30 line or something like right yeah okay got it and one final question i have on this is is this something that was like i see the azure icon up there like is this only for things already in a vm or can i use this before i migrate i actually ran um if we had time i would show you my i ran this on prem you can get you can run this on prem i actually ran this on a vm as well so so you can you now obviously if you're if i'm just running this on prem and you don't have anything in a storage pool this is just going to be completely empty right um but but but you can also run this on prem to get an idea and the most important thing on prem is that to me on prem this this tab becomes really really important oh yeah cool yeah this is super handy um definitely something that i would i would hone in on so um i'm gonna go through this next piece pretty quickly azure monitor is really big but what i want to do is spend some time about mapping kind of our data elements to a table touching on the metrics showing you the um the storage configuration and then we'll kind of wrap up some other topics so here is that edsv4 that pam and i absolutely love so i'm going to map some map like what you would see in the documentation to some key scenarios so when you're looking at sizing and leveraging our documentation you're going to want to match like i said the source performance metrics to the documented resource capabilities for your vmn disks so you're going to match your on-prem iops and throughput to your max uncashed disk performance you do this by matching the iops and throughput metrics to the maximum uncached disk iop and throughput per second these are the primary resources that we want to size against i up and throughput to the uncached disk performance okay your next scenario is to map is to map uh data file reads plus tempdb reads and writes to the cache limits and you're going to do this by matching the data file reads and tempdb reads to the maximum cached in temp db storage throughput this is only available when when read caching is enabled and essentially it gives us a 30 buffer of additional performance helping us reduce what's counted against the uncashed throughput and that's going to help us again weather the storm as we talked about before we're going to tell you to monitor your percentage of processor time in the documentation you're going to see the number of processors available per machine in the potential constrained cores that can help from a sql licensing perspective we actually covered that concept in the sizing video it's recommended to size based on eye up and throughput and not let cpu licensing drive your testing and drive the discussion monitor available memory for resource usage efficiency it's important to make sure that sql server has the memory it needs to perform well and you can match the memory needs of an on-prem environment using available megabytes and looking at sql server memory help target in total sql sql server memory target in total basically how much is sql server will it willing or how much does it want to consume and how much can it consume and that's the gap that you want to monitor in the counters also check the error log to see if a lot a lot of memory has been paged out for any reason and you can use management studio for like memory health over time to get a look at your ideal memory needs and overall stability ensure the number of discs does not exceed the minimum state for your vm size it's kind of a watermark of how you can scale against the vm table monitor your network bandwidth i mean we know that drops and disconnects are bad right that you know i'm not i'm not saying you need to be a network expert but we know that that's not good so if you see that you know connect work with your server network team and of course ensure that tempdb can fit on the ephemeral disk and that's really a capacity discussion now now that we've kind of understood this and you know again this this series builds upon itself we're looking at looking at your i hear from uh looking at your uh your your metrics and monitoring um you can think of your azure monitor metrics a lot like perfmon okay these are like perfmon metrics their metrics are numerical values that describe some aspect of the system at a particular point in time and your logs are going to contain different types of data organized into red record sets so telemetry such as events and traces are stored as logs in addition to performance data and it can be combined for analysis what we're focusing on is the azure monitor metrics component that drives that piece of azure monitor okay now in particular and and so now so like think about everything that we've talked about we've moved our workload to a virtual machine in the cloud and now we're testing that cached scenario for both vm iops and throughput and disk iops and throughput so this is your virtual machine level performance metrics for iops cached iops and bandwidth and uncached iops and bandwidth okay bm level and we also have our disk iops and bandwidth for data disk as well as os disk iops and bandwidth now for either of these scenarios for your vm level or your disk level if you get reach 100 you're you're at or near 100 that's basically your capping level so if you hit 100 you're capped right so so this go you know this you know we covered this in the storage in the storage uh deep dive but this is what the core concept is and so what i'm going to show you really quick is what that looks like from a from a performance perspective okay so i've been running this workload and what i want to hone in on is we have this scenario where my un my vm uncached bandwidth is right at a hundred percent my this has been this is my vm cached bandwidth consumed percentage right here this was at 100 previously so in this case my uncached bandwidth is completely completely capped and it continues to be my cash bandwidth is not so much now if you look at my disk level my data disk bandwidth consumed percentage and my iop consume percentage this is never i mean i've got a spike i think up to 65 percent but otherwise on the disc level i'm not really coming close to being consumed where i'm really my consumption my consumption is at the vm level and so this is a scenario where you're either going to tune or resize the vm so if that's that's the scenario that you're looking at if i had this at the disk level if i had this at the disc level that would be that would be a little more difficult to to address so i would either need to tune that tune that scenario look at my caching scenario it's something small change that i can make if if not then i'm basically going to have to do something more of an overhaul from an architectural perspective on my storage to address that because storage is is definitely going to be more intensive to address than just simply resizing a vm any any thoughts anything to add to that pam um any questions on that anna yeah the i mean one thing i just wanted to comment on the throttling is as you mentioned this in a couple of past presentations as well david that just because you're being i'm sorry capped not throttled just because you're always being capped periodically doesn't necessarily mean that you have a performance problem so if you if you once in a while see these metrics hit the hundred percent and stay there for a few seconds and then come back down you're probably okay you probably don't need to resize i mean you know the system depending on how sensitive you are to this the system can usually handle you know a few spikes like that but something like david is showing here where you're pegged at 100 that's probably more of an indication that you need to upsize and then in this case it's the uncached bandwidth that's that's capped which means you need to go to something that has higher remote throughput the vm needs higher remote throughput um so that's so so that's kind of what you look at and one thing you want to make sure is that all of your drives that can have recaching have read caching turned on because you see you've got plenty of cash bandwidth left there like you're you're not even getting anywhere near your cash bandwidth limits there yeah so so maybe right so maybe you would check and make sure that all of your data drives that are hosting data files have read caching turned on right that'd be one thing you would check another thing you would check is make sure your tempdb is on the ephemeral drive and then it's not using those remote iops because the ephemeral drive goes against the cash limit not the uncash limit so there's a few configuration things you might look at first if you're seeing this capping and make sure you've followed all those best practices we have all of that listed out in our documentation under the best practices checklist there so you would just make sure that you're following everything in that checklist and if not you might be able to address this capping without even changing the vm um so check that first and then if not then yeah upsizing the vm is probably where you're at yeah that's a really good point pam because if you're and i think this is the example we gave in the in the uh storage discussion but if the application is requesting 10 000 iops but it's only getting back 1500 that's such a ridiculous difference that you know that the the application user is going to notice that and they're going to be suffering from that right but i think the example i think for like a ds uh d8s v3 is 12 800. so let's say that the application is asking again for 10 000 iops and and it can't it it can only um or let's say 15 000 iops and the d8s b3 is 12 800 so that difference is not that it's not that much is the application user even seeing it yeah if they're not if they're not seeing that just because you're being being capped it doesn't necessarily mean that it's a massive issue that you have to rush and resize something or re-architect something okay it's a very good point um because what your metrics are going to show you is does capping exist yes or no right yeah yeah and then like you're you're set up to kind of use all the things that you've been learning to check so that's why i think this is so important because not only does monitoring but you have to understand like some of the nuances that you all talk about like oh because this is happening you should check this um instead of just you know scaling up or resizing and right either not addressing the problem or making a more expensive solution yeah yeah exactly and and and most of the monitoring in vm like you can if you're already doing things on-prem you can probably do the same kinds of things running in vm that's a great thing about running sql in a vm it's the same thing so if you were already using like management studio has built-in reports like performance dashboard reports if you're already using that keep using it like the the that stuff is all the same if you were using perfmon before keep using it you still have perfmon in in your azure vm so whatever you were doing for monitoring before you can still use that but now you have these additional tools on top of that with the azure metrics and the perf insights that will kind of help give you some more azure perspective because the thing that's different in azure is that the infrastructure is different sql server is the same everything inside the vm is the same as what you're doing on prem but the infrastructure is different and how you address those issues are different so maybe you bump up the size of the vm maybe you add more disks to your storage pool you know those are the ways that you address it but the ways you find the problems are very similar to what you're already doing on prem yeah you may have a couple of things that are like like sql insights specifically specifically for an azure virtual machine or or you may have like azure metrics for example but the core methodology does not change from what you're doing on prem to the cloud is what i hear you're saying and yes and i and i would also this is why i stress like yeah um this is what you're doing for your source monitoring and shaping and sizing activities but those same methods should be carried along with you as you've done your migration one to make sure am i fitting the way i think i would be fitting in the cloud and do i need to continue to to look at resizing based upon a workload because some of the some of the assumptions and conclusions that you've made previously there may be in a variable that was missed or variables as we all know change and if something changes i need to be able to respond to that change so again like what we're looking at is these vm level metrics um and i'll just kind of put this here uh fairly quick and then move on to by the way the the reason where we're where we're spending all this time on i o and throughput and we're not talking about very many other metrics is because this is where customers get tripped up in vm i mean that's why this whole series is kind of focused around i o and throughput because more than 50 of our performance cases are due to i o throughput issues in vm so that's why this is what we're harping on um so it's not that monitoring cpu and memory is not important it's just those are a lot easier but then storage so very very quickly i'm going to show the uh the sizing tool that it's a console app that's part of uh database migration assistant 4.5 um and essentially what the this is what the methodology that's used for the the console app in the data migration assistant it's you'll see it in the sql assessments folder and then you can run run that console app so it grabs an aggregate of performance every 30 seconds for as long as the workload is running it takes all of those candidate scenarios that matches i mean think about the sizing tree that i walk through think about the the memory consumption cpu all those kind of scenarios it meets those resource situations and the capabilities and then ranks it based on price and will give you that price ranking for sql db manage instance or sql vm and let's take a look at that really quick so if i and i'm just going to bring this over this is um and i'll zoom in on this to so for clarity so if it's so if if i were if i were to run this and then i've already run the the performance um capture if i were to run this really quick you could see here that it's recommending that i go to a e8 asv4 um p30 caching is read only for my log caching is none use the ssd the local ephemeral disk and it's breaking down the recommendation reasons fully for what i would be doing on this particular machine now if i were to run this separately i can use the same performance capture and also run this for manage instance or sqldb um it's it's not necessary for you to do anything special uh so you this is manage instance business critical for manage instance yeah so it's just something really quick to look at that will also help you with sizing and again remember our best practices to get the perfmon source metrics choose the right disk isolate your files enable host caching make sure you're doing your i o best practices like data compression column store indexing any your it's still sql server after all target your ephemeral disk on the d drive use the storage configuration wizard that will address in in your azure marketplace images 80 to 90 of the key things that we recommend and of course continue monitoring with azure monitor monitor monitor monitor we said always be tuning before now we're saying always be monitoring and take a look at our references the we've we've uh um all the documentation has been has been overhauled and uh we continue to work on that to make sure that we're giving you the best advice awesome this has been a jam-packed show and we've learned so much i want to thank you david and pam so much for coming on the show to our viewers all those references that david just showed are in the description of for the video on our youtube channel so we encourage you to go ahead over to our youtube channel and subscribe check out the links like this video um and and really with that you know we hope to see you next time on data [Music] exposed [Music] you

Info

Channel: Azure SQL

Views: 364

Rating: undefined out of 5

Keywords: microsoft, developer, azure sql vm, azure sql virtual machine, azure sql virtual machines reimagined series, sql server, azure sql database

Id: Yo156-eb920

Channel Id: undefined

Length: 60min 32sec (3632 seconds)

Published: Wed Jul 14 2021