Introduction to SQL Server Failover Clustering

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to a gentle introduction to sequel server clustering this webcast is sponsored by idear a software my name is kendra little my work with Brent dough's are unlimited my Microsoft Certified Master in sequel Server 2008 and I've been awarded the Microsoft MVP award which means that I like to help people understand how to get the most out of their sequel servers and I talk a lot about sequel server do a webcast every Tuesday you can learn more about that at Brent oh sure com and I like to draw so I publish posters on different topics in sequel server and I'm excited that I've just launched a new poster on sequel server failover clustering so if you like what you see in this webcast today go on over to bronto's are a.com slash go slash cluster and read more about sequel server failover clusters and download a poster to help you remember that concepts that you learn in this video today you're going to learn about what pain points failover clustering addresses for people using sequel server and how failover clustering can help you out what it will do for you will also talk about the problems that clustering doesn't solve and where you're still vulnerable even if you're using failover clustering and also why failover clustering is pretty critical so learn in terms of making sure that you understand some foundational tools that are used in new and coming versions of sequel server for high availability and disaster recovery so first we are going to keep it simple and today's talk I'm really talking about failover clustering used in a single site with a single subnet know that the concepts that I'm talking about today you can with recent versions of sequel server make them even more complex and once you learn the foundation and get the introductory stuff you can then expand and make these things more complex we are gonna start pretty darn simple to get started let's just figure out why does feel over clustering help us with there's some really basic core pain points to keeping sequel server available and healthy and functioning properly the failover clustering addresses our data in our environment is very important sequel server holds databases that hold this critical data and when we think about databases we think of them in the context of a sequel server instance a sequel server instance has user databases as well as system databases and these system databases which are master model ms DB and temp TV some of those system databases are used to hold things that you define like who can access the sequel server using logins what links servers might be defined in case distributed transactions are being used between sequel server instances what maintenance jobs are set up all of that contain within an instance container we normally have if we don't have failover clustering installed we normally if we just install an instance in a normal way we have a sequel server instance that sitting on a Windows operating system that's on a physical server so here you see there serve sequel node one is a server and there's an instance that's just one instance that's sitting on that server pretty normal right the databases are available on the instance customers are accessing the data everybody's happy things are working things are moving along applications are moving oh we've had some security updates come out and we realize okay we shouldn't we should apply these security updates to our windows instance because that's a good practice we've we've done our due diligence we tested these updates and you know non production environments and we know that they're important we want to make sure that our servers are healthy and protected hmmm well we have a problem because often when we apply security patches we usually and in fact I always after applying security patches want to reboot the server to make sure that if any dll's have been updated or anything like that that window shuts sound comes back up and all of the changes are made nothing in a pending installation state so me even if we reboot isn't required for a set of updates you can bet that I actually want to do it just just so I don't have any doubts about the state things around but the problem is if I've just got this instance on the server and I'm rebooting it my application is offline while things are down my users are no longer able to get to the database things aren't active my applications are offline that's kind of a problem and I don't know if you've ever had this happen to you but you know sometimes when you reboot a server it doesn't come back right away sometimes you're like hey did you even shut down all the way hopefully you've got something like a rib cartilage in to mean see the console the server and say hey what's up with you but you know sometimes you're sitting there waiting for it to come back up and it can take a long time meanwhile your applications are offline this leaves us to doing things like plan maintenance maybe on the weekend but in many situations even on the weekend if we're in a situation where things are taking longer than expected many of our databases do have users around the world and processing needs where downtime on the weekend isn't even okay so this can get really really uncomfortable not only for you is now burning lots of time on the weekend but for customers we also have the possibility that our server could go offline when we are planning it there could be problems the hardware level where things are are you know going bad and causing things to crash there could be problems at the operating system level registry corruption issues there or we we could just have something we want to update in plan maintenance all of these things can cause outages and a lot of times when we have failures you know it doesn't always come in want the pain point that we have is wouldn't it be nice if our sequel server instance wasn't so harnessed to a given physical piece of hardware and wasn't even so harness to give an operating system when it be nice if our sequel server could be independent and could move around so that if there's a problem on that hardware or there's a problem on that operating system our sequel server instance can be free and could roam about the environment yeah that would be really really nice and this is the pain point that failover clustering addresses for you it helps the sequel server instance move around a cluster of servers so if there's a problem on one operating system or one piece of physical Hardware your sequel server instance can be a little bit free right depending on how you have your cluster set up here's a really basic drawing the simple implementation of a Windows failover clustering this picture doesn't even have a sequel server in it yet we've got storage we've got a networking environment represented by the box labeled switches and we have two physical servers we're gonna call these nodes one of them's named sequel CLU oh one node oh one and the other is named sequel CLU NoDo to each of those servers has an operating system running on them and you can log into Windows on their servers you know they each have a name and IP address they are both accessing that storage and this is critical that they can both talk to the storage this is key to the concept of the first feel of our cluster we're gonna look at if we install sequel server on the cluster and to do this you do you have to first configure the cluster in the operating systems on these guys it's a very specific windows feature that you enable the feature in the operating system and then you configure and create a Windows failover clustering in there and that sounds kind of like a big deal but honestly on recent versions of Windows even on 2008 and 2008 r2 it's actually not super hard there's things like a cluster validation wizard and report that'll even take a look at your cluster and do all sorts of tests on it and say here's what looks good here's what looks like a problem and also here's what looks really really bad so lots of help from Windows setting up the feel of our cluster Dale once you install a sequel server instance on the cluster which the sequel server setup guide you through you tell it hey I want to install on this cluster and you've got to put it on all the nodes and it says hey this is a cloak I see your cluster here we go and it steps you correct we here have a clustered instance it's named sequel CLU zero one a slash sequel a now that seems like a mouthful I'll explain it in a minute but the important thing to realize first is that it's that whole instance that's installed on the cluster and it's gonna stay together as an instance it may leave windows behind it may leave a physical server node behind but that instance stays together and this is actually really handy there's some technologies like database mirroring that are that take individual databases and you don't get things like the agent jobs or the link servers moving around in this case the whole instance the whole family stays together and if it needs to move it will all move together when you talk to a clustered sequel server instance it's a little bit different than talking to a standalone and since you connect with what's called a network name or a virtual name and this is a name and IP address that's specific to that instance of sequel server you also have what's called an instance name and this gets a little confusing because I've said the word instance multiple times the instance name is just like what's a named instance in sequel server even if you didn't have it on a cluster like you can have a default instance or multiple named instances on a server you actually do need both of these this confuses a lot of people at first they're like why do I need both the network name is the name that you assign that goes with an IP address that will move around with the instance so that it isn't just tied to the individual IP address and name of a single physical node right because our instance can move this is its virtual name or network name and the instance name which could be default you can have one default instance per cluster the instance name is really what you can think of as you know how am i identifying this within a you'll serve her family on the cluster right and it does need to be unique within the cluster it goes with a port number you can actually have multiple instances on a cluster all using port 1433 even if they aren't the default instances that's actually ok since the port is tied to the IP address and everybody gets unique IP address all the network names get unique IP addresses you can actually have them all use 1433 if you want to actually use that for them so we can access our instance in a way that's independent of physical Hardware that's the most critical thing to realize about this slide is that the instance has a way you can talk to it that isn't just a physical node and that's key so let's look at an example it's time to install Windows updates now now I've got my sequel server instance on a cluster how is this different than before let's say that I'm a little naive right I might actually planned my patching in a complex way what I might want to do first you know if I if I was planned 'fl is I might want to go ahead and install my patches first on sequel cou a1 node Oh 2 and reboot it first and let that sit there for a little while at least an hour to make sure nothing weird appears in the log hopefully I've tested them elsewhere but maybe I don't have identical hardware elsewhere so if I'm really smart I'm gonna specify my patching order in a specific way but even if I don't do that even if I'm kind of naive and one of my first steps is just hey I'm gonna reboot node oh one to get the instance off of it so I can patch it right and maybe it's not you who's doing the patching maybe it's somebody else who whose job is to work on the weekends and you're taking taking the time off I send a reboot command no.21 and windows says hey okay I'm gonna shut down it notifies our failover cluster you know what I got a reboot I gotta go down you should really take this instance and you know do something with it and if failover cluster looks around and figures out hey no tow twos up over there I'm going to go ahead and move this sequel server instance to it so that people can continue to access it that sequel server instance starts to prepare to move and it does this by shutting down the sequel server services that are on node a1 the reboot starts happening my sequel server services shut down during this period you can't access the sequel server right there's a transition period where my sequel server instance is no longer on physical node 1 but it isn't yet up on node 2 and during that time my applications can't talk to the sequel server if a transaction was open and wasn't completed at the time sequel server needed to shut down then that application needs to pay oh this after this transaction wasn't successful I'm gonna need to retry it once services get started up on node 2 my sequel server services have come up because the failover cluster manager on node 2 said hey okay start up the sequel server services here they start up the system databases come on line master model MSU beat MTV even after that happens I still need to recover my user databases my user databases have to come online in sequel server for each of them has to say ok I'm going to look through your transaction log and figure out what transactions need to be rolled forward what transactions need to be rolled back so that I can bring these databases online in a consistent state in factional consistencies super-important sequel server so we go through a period of recovery and during this period your application you know it can't use it can't do rights to a database that's in recovery it's it's in a state where it may not be yet fully online one of the questions we frequently get is you know how long will this period of failover happen you know take how long will it take things to failover and the duration of that is going to vary depending on how long this redo undo cesses which in parts going to be due to the nature of what your application does how heavy the workload is during the application potentially the configuration of your transaction log virtual log files that type of thing and how frequently you do transaction log backups when the last checkpoint was lots of factors play into that so if you do this at a really busy time failover could actually take longer and on the weekend you might still want to plan maintenance to happen on the weekend so that this downtime happens and recovery happens in a lighter volume time after recovery completes hey cool we are back online our instance is now on node 2 and note that my network name is still the same I can still talk to my instance on sequel feel.you o1a slash sequel a that hasn't changed my network name and instance he moved across the cluster with me so that not you know all my connections strangers let's say my application is like oh there you are over there I'm gonna keep on going my customers are happy again this is a really really helpful feature if I have sudden hardware failure if I suddenly realize that I need to do something to one of the nodes in that cluster or even both nodes I now have a way that I can strategize ok I'm gonna test this first over here and then I'm gonna you know do these steps to validate that it worked and then I'm gonna move things around and then I'm gonna do it over there there are even tools that you can use to put variety of things in either maintenance motor in a frozen mode to make them safer as well so there's lots of things you can do to help make change and unexpected events easier when you get to know failover clustering one of the things you're gonna get to learn about is quorum quorum has to do with part of the logic about how different components in the cluster get to vote right because there's ways that the cluster gets to decide hey where should this resource go and should I stay online in our current configuration we might have a node in disk majority each of these nodes or physical servers I have two of them here get one vote and then my storage get two votes that's total of three votes it's a nice odd number if one of these members is down like nota one there I've still got two voting members you're like hey I see you hey I see you hey we're good all right as you look at doing more complex clusters this starts getting more complicated you can bring in things like file chairs to start voting you can do all sorts of fancy things depending on the number of nodes you have and the good news is your cluster validation wizard will give you some advice about what makes sense and sequel server actually gives you some advice on some of its features about how make sense you always know that you should research quorum and that you should look at your version of Windows that you're using as well as your version of sequel server and the features that you're using whether you're using just simple failover clustering or some of the availability groups features we're gonna talk about later on one of the really cool things that's happening is that current meaning sequel or Windows current meaning Windows Server 2012 and later versions of Windows like the upcoming Windows Server 2012 r2 are introducing new features on how quorum is done and how votes are counted when certain instances are offline that especially make a huge difference if you have larger clusters so very very cool changes are coming and whenever you set up a cluster thinking about how am I gonna set up form is key to understanding what happens if a component in this fails because what the cluster does will vary depending on how many members you have and how many are online and how your quorum is set up there are many many design options for clustering and you can put more than one sequel server instance on a cluster too so here we have a two node cluster node meaning physical servers cou oh one node oh one cou oh one note oh two are still our servers we have the same numbers here we still have the switching you know Network environment they're both still using the same storage no shared storage so both of these guys can access is storage but I have two sequel server instances here hmm what is this like I tend to get a lot of questions about this because people will ask the question like well if I install two sequel server instances can they somehow use the same database and that question tries to get at what people are usually trying to ask there is is there a way to kind of scale out reads and access the same database on to physically different servers of the cluster the answer to that is no what you see in this picture is you do see two instances we have sequel CLU 1a and it has some user databases on it we also have sequel CLU o1b at the top it has totally different user databases on it each of these instances will use its own lawn or chunk of storage they are not using the same parts of that storage they may well be using the same storage array or sand or storage device to manage that storage but they each have different chunks of the storage sitting under their databases and their databases are unique we aren't reading the same data on those instances at all so these we can't have more than one instance but they have totally separate databases on them and they operate independently of each other each instance has its own logins its own link services servers its own agent shops you've got a configure that was on each instance the reason that you would really want to do this and the reason I find people are interested in this is people want to use the hardware in our previous example we had to physical nodes to physical servers and we were only ever using one at a time sometimes people look at that and I say you know I'd really like to use that server over there well there are some trade-offs to doing that yes we can do this and use you know try to get the most out of our hardware in this case we do have to buy an extra light since right each instance that we actively use we need to license the general big picture deal with licensing is that you can get one idle node for free so if I just have one active instance on this failover cluster I just need to pay for one license and that license kind of moves with it you know to one other node so I can get one idle nice ins for my instance now if I have two active instances yeah I've got I've got to actually pay for those you've got to pay for what you use you only get kind of one freebie for instant right and if you want more details on that you need to look at the licensing white papers because it does get a little bit complicated in there so now I've got two instances I've got a license each of them and you know often licensing is pretty expensive now the good news is you can't actually use standard edition if it's the right fit for you one of the myths out there is that failover clustering only works with Enterprise Edition sequel server and the good news is that's just not true sequel server standard edition does allow you to cluster you can only use a two node cluster right I can't put it on an eight node cluster but to be honest I might not really want to when I design clusters many many no which isn't necessarily fun for managing them and the most common cluster implementations I see are actually clusters with to physical nodes that's one of the most common ones out there so standard edition can be clustered of course there's other limitations that come with standard edition so you need to make sure those work so but but licensing is one of the more expensive things in my environment so then you get into a situation where people are like um okay Bob my hardware that's expensive and I had to pay for all this licensing so I want to get a lot of use out of these instances and they end up being really busy well the problem with that is let's say I do have a problem on one of my nodes and I need to keep it offline maybe it's an accidental failover maybe it's planned maintenance doesn't really matter for the period where both of my instances need to run on a single node if those instances are busy instances this may not be so good for my customers maybe I'm on line if performance gets real really bad my customers may feel like I'm offline right if both of those instances use more CPU resources and more memory the nodo 2 can provide both of them it can get kind of ugly right so this situation of multiple instances on a fail of our cluster it can be okay but running all of my instances on Hardware maybe isn't really an option in all situations you've got to really be careful about how much risk can I tolerate for failure scenarios and what are the limits of performance that I can reach while you know will my customers really be happy if we have to run like this for a day for a week for longer be a big deal clustering also doesn't solve every single problem that you have it doesn't prevent failure in all sorts of ways it will help you with many kinds of failure but we do still have with clustering a single point of failure when we have a failover cluster your instance can move from one piece of hardware to another piece of hardware but it's accessing the same data on that shared storage resource the individual servers don't have a copy of the data on them right we have that shared storage and wherever our instance moves it's just starting up a sequel server and since it's going and saying hey there are my databases over there it's the exact same data no matter which node you access it through that's part of why it can be so fast it only has to update one copy of the data it's not like database mirroring or or an availability group replicas where it's syncing a second copy of the data we have a single copy of the data and therefore we have a single point of failure so if something terrible happens which we never ever want to happen like oops a user accidentally deleted data in a critical table or you know we had an attacker who compromised our application and changed a bunch of data or dropped a bunch of stuff where we had data corruption we've just got one copy the data we've got to have backups that we can restore from and operational techniques that we can help address these problems when they happen to us clustering doesn't keep these bad things from happening to us if I you know if I have data corruption and I just fail my instance over to another node I've effectively just kind of restarted my sequel server and in theory it could make my problems worth depending on what type of corruption it was and where it was at least it's certainly not gonna get any better right it doesn't help us at all clustering is really critical to the future of high availability and fast recovery it's a really exciting thing to learn about because Windows failover clustering is pretty key to lots of technologies that sequel server has introduced in this building on here's an image of an always-on availability group always on availability groups are built on a foundation of Windows failover clustering but they don't necessarily work in the exact same way that the failover clusters we've been looking at are so this image is a little different than what I've shown you so far but it's still related to Windows failover clustering here we have two physical servers sequel a go1 node 0 1 and node 0 2 and I did draw these to be a little bit bigger I do these as if they're for you servers instead of two your service just to kind of represent hey these have a lot of drive stuff in them so they have a lot of local storage because we don't I don't have that storage element drawn separately here we we don't have separate storage here here in our always-on availability group what I have is node 0 1 it has 3 user databases two of them are defined in an availability group these databases they have their storage on Notah one the data resides on Noda one and the availability group is transferring data over the network to the storage on node two and those databases that are in the ability group have their data flowing as data goes into their transaction logs it gets streamed over the network and a second copy of the data is over on Noto - now you might say hey this sounds like database mirroring it kind of does it's like database mirroring on steroids each physical node is a member of a Windows failover cluster that's been built and the cool thing about this is that the windows failover cluster it's not doing the same thing with the shared storage here but it is providing some other features like a listener and like it's quorum features it has improvements over technologies like database mirroring TV smearing oh the the quorum model is not so kind it's pretty limited so we have lots of new options with failover clustering as the groundwork here on our availability group here I have my primary with the two databases in the availability group I can access them through that listener who's represented here by the unicorn and my listener can move around as things fail over I can also do things like make some replica databases read-only I can have multiple replicas I can make some of them synchronous I can make some of them asynchronous and I can do all sorts of really cool things all built on that foundation of failover clustering so haven't given you a full explanation of everything you can do with availability groups here the main things to know are it's all built on a groundwork of failover clustering and that it's really critical to learn failover clustering and all of its different permutations because this stuff does get complicated and you can make it complicated here's a picture I don't want you to start here but it's kind of interesting to know in terms of over the places you can go in this drawing we have Windows failover cluster at the beginning it's got two nodes in this case they've been named sq ag0 one CLO one and you start to see that using short abbreviations for consolidated names can be can be really helpful because if you were to use long version your names be a mile long we've got to physical nodes that are in a fail of our cluster and they're using that shared storage so the instance that's in blue like if if people a Gio wants yellow one if that physical node goes down it's gonna automatically fail over to the other node there yep it'll be offline while it moves over there but it'll have an automatic failover just like this scenario we described at the beginning but at the same time that that stuff is happening I can also have that guy at the bottom the green instance down there it's a replica is an availability group replica and in my case it's asynchronous and read-only and its handling read-only requests for scale out reads and it can just be down there receiving changes as they come through and presenting them for reads while the primary instance is maybe failing over to another node in the cluster and I think don't start here because as things get more complicated the room for things to go wrong increases your high availability solution could be the main cause of your outages if you're not careful if we lose quorum if there's a bug if there's a race condition if there's a problem our instance may decide oh hey I should shut down right if the if the cluster decides everything goes bad maybe because of a configuration that we made because we didn't know exactly how to configure quorum in the best way or maybe because of a change that we plan in an incorrect way or maybe just because of things beyond our control then things can go offline so we want to start in a simple way and don't want to introduce complexity that you don't need you want to look at what is the environment that I really really need and where can I start to make sure that I can support it and then it's not more complex than I and my team can handle cuz your team needs to be able to handle this if you happen to win the lottery you're right and I don't like to take it hit by bus I like to say if you win the lottery so we learned a lot today about the foundation of clustering Plainfield over clustering I'm a huge fan of failover clustering it saves your bacon when things go wrong at the individual server level or at the operating system level it really can make your life better when it comes to planning maintenance and just just responding to those little things that can go wrong in physical Hardware it's great for that we do still have storage as a single point of failure so how we plan our backups how frequently we take them how resilient our backups are and how fast we can respond to data being deleted data being corrupt that stuff is still super critical do you know also it's tricky to pair failover clustering with some other technologies I often get the question what if I have a failover cluster that's all virtualized and the answer is well technically that is supported and technically it will work but if you have problems with network communication between the nodes things can start getting really really wacky if there's latency between the servers and the storage things can get really really wacky when you do this you've combined a hypervisor layer with the windows failover clustering with sequel server so if you start having things like unexplained failover sorting out exactly what reason it is for those failover gets really really hairy and you've got network administrators and sand administrators and database administrators and sysem and all arguing with each other that's a pretty complex world when you're merging these things in with other complex technologies do tread with care don't start there either we've also learned that the future pile availability and disaster recovery is grounded in failover clustering so it's a super critical technology for sequel server people to learn one thing I didn't mention is many of the technologies like those replicas if you have an availability group replicas MIT does have a separate copy of the storage certain things are a little easier if sequel server to toxic corrupt page it can say hey I've got a replica who has another copy of the data I'm gonna go out and see if I can grab the page from there and fix it up just like in database mirror and database mirroring has that too so you have to be using a replica with separate storage but if you're doing that you've actually get some perch there in terms of corruption now obviously if somebody deletes a bunch of data from a table it's just gonna replicate them over to the replica so it doesn't help you in every situation necessarily we've also got evolving settings for quorum in new versions of Windows Server that help us out in multi multi node clusters clusters with more and more nodes and help do things like recalculate better if some nodes are offline reconfigures the populations of possible voters which is very very cool if you'd like to learn more about clustering go to bronto's Ericom slash go slash cluster at this address you can get to a link to an article that covers lots of stuff about failover clustering with tons of Q&A at the end the poster that i've just created on failover clustering more details on why you shouldn't necessarily virtualize your cluster as well as a great post that Brent wrote on why everything you know about failover clustering is wrong I think is a great post thanks for watching today's video and join us again for a webcast soon
Info
Channel: Brent Ozar Unlimited
Views: 81,417
Rating: undefined out of 5
Keywords: Clustering, SQL Server Cluster, Unlimited, AlwaysOn, Availability Groups, SQL Server, Brent, Ozar, Active Active Cluster, MSSQL, Failover Cluster, Brent Ozar Unlimited
Id: zHDk7f90Atw
Channel Id: undefined
Length: 36min 32sec (2192 seconds)
Published: Wed Aug 21 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.