Proxmox VE How To Setup High Availability

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
one of the many reasons for running a hypervisor cluster is to reduce downtime in other words if you've got a standalone hypervisor it builds then all of the virtual machines that was running will go out of service now fortunately hypervisors like proxbox ve can help to reduce that downtime because they provide a high availability service so provided that the hypervisors have all got access to the same shared storage and you're running virtual machines from there then if for instance one of your nodes were to fail another node and then actually start up those virtual machines how do you configure High availability in proximox ve well if that's something that you're interested in finding out then stick around and watch this video because that's what we'll be going over now because this video is specifically about setting up high availability I'm going to have to make some assumptions otherwise the video is just going to get too long so the first thing is I'm going to assume that you either already have a cluster or you know how to actually create one now in my case my cluster is made up of three servers which is really the minimum that you should have but you can have larger cluster sizes it's just best that you've got an odd number of servers in that cluster in any case if you don't know how to create a cluster I do have another video that you can check out which shows you how to actually create one now another assumption is that your hypervisors have all got access to the same shared storage because that's essential value availability in other words if you're running a virtual machine on a hypervisor but its files are stored on the local hard drive of that server well if that server stops working none of the other hypervisors can get access to the files so they can't start the virtual machine now there are different types of shared storage available in my case for example I'm using an NFS share but what you use is entirely up to you as long as all of your servers have got access to the files of your virtual machines now if you're not familiar with how to set up shared storage on proxbox VE I do have a video that shows you how to set it up to connect to an NFS share now my last assumption is that all of the actual servers in your cluster are keeping the correct time and that's because for AJ to work properly clocks on these servers do need to be kept in sync Now by defaults a server should be trying to connect to the internet to get its time from an ntp server so unless you're actually blocking that access you should be fine however it is still worth double checking easiest way to do that is to navigate to Data Center and then to air chair and then you should be able to see what the actual times are on your actual servers and then you'll be able to check if they're all in sync or not now the first thing to do when it comes to setting up high availability within proxbox ve is to create a high availability group because what this determines is which of the nodes within the cluster a virtual machine should be running on so to do that we go to Data Center then down to h a and then click on groups we'll then click on create to create a new group you don't need to give it an ID so make this something meaningful something that makes it easier for you to understand what the actual group is meant to provide you with so for example I'm going to call this one all nodes because I want to set up just basic redundancy I want to be able to set this up so that my virtual machine can run on any of the nodes within the cluster for example can pick out individual nodes from the cluster or what I can do is just select them all in one go so that's useful if you've got a large number of servers within your cluster you can be very selective if you want as to which nodes your virtual machines can actually run on but I'm just going to keep things simple for now and just leave it at that the other settings I'll cover later to create the group we then just click create and we've now got a group that we can use to assign virtual machines to well now that we've got a group created we can start setting up high availability for our virtual machines and this is done on a per virtual machine basis and there are actually two different ways that we can do this within boxmox ve one option is to pick a virtual machine and then click on the more drop down menu and select manage ha another option is to go to Data Center go to each here and then under here where it says resources you can click add then you can pick out a virtual machine from the list as you can see we get exactly the same dialog box no matter which of these two options we choose now there are a few different settings that we've got for high availability for this virtual machine I mean obviously we've got the virtual machine itself picked out but we need to set what group it's going to belong to so click on the drop down menu and I've only got one group to choose from so I'll select Pat over here we've got Max restart now this is the maximum number of attempts to actually restart a virtual machine on a Nord if that actual virtual machine has stopped for example so let's say we've got a virtual machine and it's running on one of the nodes and it just stops working so it goes into a stopped state this is going to determine the number of attempts that the resource manager will use to actually try to get that virtual machine back up and running on that same node so you can set this to whatever you particularly want by default Etc one let's say for example changes to give it three just to give it a bit more of a fair chance for instance we then got Max relocate I'll set that three for instance but what that determines is the number of attempts that will be used to actually migrate the virtual machine to another hypervisor in order to get it back up and running now when I say to get a virtual machine back up and running it really depends on what the actual request state is by default it is actually set to started if you click on the drop down menu there are other states that you can actually Target most cases you would want the actual virtual machine to be running if you want to know what these are you can click on the help button here it opens up this tab and it gives you a bit more actual details about what these actual other states are so that may be for say maybe you're decommissioning things or maybe you're in the middle of maintenance for example but typically you'd want the virtual machine to be in a started state in other words you want it to be running all the time you're the field that we've got here is I mean no this is useful to give the actual virtual machine a more meaningful name within this window pane here so let's say for example this is my DNS server for instance when I click on the add button you'll see we've now got a more meaningful description as to what on Earth it is we're doing here so we've got vm101 for instance we know it's called zorin1 but this is a bit more meaningful saying that this is DNS server number one as you can see it's already kicked off and it's in the process of actually starting up that virtual machine on node number one and that's basically because we've actually got a group set up where we weren't very specific in terms of which node we actually want that virtual machine to run on but you can actually reassign these you can set these with different groups and then you can start to be selective for instance in terms of which node you actually want a virtual machine to run on now I've since added in two virtual machines for the resource manager here to look after so they all belong to this group all in the woods which means they can run on any of these three actual nodes they were all originally assigned to node one so when they've started they've all been left on node number one in other words there wasn't a particular reason to actually migrate any of them to node two or node three but you can influence which node a virtual machine runs on by creating groups and setting different priorities for the nodes within that group so let's say for example we want zorin one here to run node three during normal operations in which case we're going to go to groups here click on Create and we'll create a new group so we'll call this one say prefer node three we'll select all of our actual nodes and we'll give node number three the highest priority of three for instance what we do with the other two nodes really depends I mean maybe we'll set it up like this for instance so under normal circumstances the virtual machines will run on node three if not we want them to be running on node 2. if not we want them to be running on node number one in other words three's got a higher priority than two which has got a higher priority other than one on the other hand maybe you're not too bothered about which of these other two nodes get used in which case we could leave the actual priority to be the same so we're just going to click on create to create a new group if we go back to h a pick out this zorin one VM click on edit and now we're going to change the actual group to be a fur node three now one thing to bear in mind soon as I click on all key we're probably going to end up with a possible disruptive change here that's because we're then going to be instructing the actual resource manager by changing the group to tell it that well actually we prefer this virtual machine to be running on node 3. that means it's then going to as soon as possible as soon as it realizes it's going to then migrate this virtual machine over to node 3 and that can actually be disruptive so you do have to be careful when you're changing the group so I'm going to click on OK and we just need to leave it a short amount of time and then at some point the state should change to say that it's going to get migrated and then what we'll see there you go States now changed to my grid and then shortly afterwards you'll see this will start to change it'll actually tell us that the virtual machines being locked down bad config's now locked and we've got a little arrow showing that it's in the process of being migrated and then shortly after that you'll start to see it pop up on node number three in other words because we've changed the group and we've set a preference within that group The Nord number three as you can see it's now in the process of migrating that virtual machine across two node number three now this is a you know very good um HS system like other hypervisors where you migrate virtual machine from one node to another and usually it's not that disruptive but it can be all the same so just bear that in mind when you change the group and it has an influence on moving the actual machine it can cause problems so it's probably best to do something like that out of ours well now that we've got high availability setup for our virtual machines it's time to actually test this so we've got zorin1 here running on node number three so if we go over to its console here see what state it's in I need a load back in so it's pulling away pinging the actual default gateway now bear in mind this is a virtual machine within a virtual machine in other words these proximox hypervisors that I've got are actually nested virtual machines within esxi so it's not that fast I must admit but it does actually work pretty well it's um it's actually quite quick to detect failovers Solutions especially compared to like manual issue where you know a server Falls over in the middle of night for example and you don't find out till the next day so what we're going to do is to actually simulate this actual server 3 node 3 here actually losing power for instance and I can do that through esxi just by simulating a shutdown so I'm going to click the power off button for node 3. and then what we'll see at some point is oh there you go so we've basically lost access to the console here at some point node 3 is going to disappear as you can see here you go back to Data Center and aha a short while what should happen is that it's no longer being detected uh by the cluster and it'll actually there you go it's now saying oh this is an old time stamp is the actual server dead basically now it is going to take a few minutes of noise it can take about two or three minutes but again bear in mind this is a like a completely virtualized um environment that I'm running here but when it does actually detect that this server is no longer available in other words the nodes down it'll then kick in with high availability and this virtual machine should get not actually migrated to one of these two other nodes depending on uh what our preferences are within our actual group now we've got it set up for node three as the preferred um nodes so if we go into that group we've got equal priority so it could end up on either of these two nodes we've just got to give it time to actually detect uh the actual outage and then start taking action well as you can see zorim1 has now been assigned to node two the group that we've got could have put it on either node one or node two but what the resource manager chose was to put it on node two now if I click on that go back to a console see what state this is in so uh I've actually got a login prompt so it's back up and running And I stress that because it underwent a cold start in other words the virtual machine was running on node 3. but what isn't happening is the actual state of that virtual machine in other words all of its memory actually ending up within files on the shared storage that virtual machine was running on node 3 within the memory of node 3. so when node 3 went down whatever it was up to at the time has been lost so only a chair kicked in it had to do actually an old start of this virtual machine and started up from scratch so whatever it was in the middle of doing as I said has been lost now that's not really a big thing in this particular case it was just pulling the default gateway but typically you'd be running servers that have got services that are running anywhere they would automatically start so you'd lose them for a short amount of time and then they'd kick back in again so maybe you'd lose a web server or something so you'd have one less web server available for your web farm and so on but it doesn't take that long to actually recover it's just really down to how quickly it takes to detect the actual outage and then how long it takes to bring that virtual machine up in this case it took quite a few minutes because this is all in you know virtualized environment anyway it would be a lot quicker if this was a bare metal solution and all these were physical actual hypervisors even the actual NASA it's using is really a virtualized true nuts server so even still it was pretty quick especially when you compare it to like a manual outage now granted within a business you'll have teams that are working outside of ours for example they'll get notified and they could take you know manual action to get a an actual virtual machine backup and running but this well let's see on this leaves you with very very little down time just a few minutes because it's pretty quick to detect an outage it's pretty quick to get a virtual machine back up and running it really just depends on what that virtual machine is actually doing as to whether it's good enough like I say I mean if it's a service it'll automatically start back up if it's an application running in um the actual foreground running within the actual console session then it would be lost but yeah it's still very very useful now to compare that with say other systems that have got dedicated aha Solutions typically where you've got a pair of computers that are constantly monitoring monitoring each other it's that's that's a different story those will work at an application layer a level where they'll be able to detect if the actual application on the computer has actually stopped working the much more sophisticated in terms of what they do whereas this is just basically checking to see has the actual virtual machine stopped working and then it'll try and get it back up and running again but all the same it's still a very useful feature to have within a virtualized environment now if you're going to be using priorities within High availability groups then there is one setting you need to pay close attention to particularly if you want stability and that is this setting here no failback because by default it's not enabled now the situation that we've got at the moment is that we've got a group here with three servers node number three is the preferred server it's got the highest priority the only thing is node number three is currently offline so my virtual machine here is rn1 which belongs to that group is currently running on node two now once node number three comes back online and is available for use what it means is that the resource manager will then try to migrate this virtual machine back to node number three in other words it's going to carry out a failback now the concern with that behavior is that if you've got a lot of virtual machines that's a lot of virtual machines to be migrating back to that actual node and if that node comes back up during normal working hours that could be an issue although it doesn't necessarily cause disruption at least that much when migrations take place it still technically a change carried out within normal working hours which maybe a business you're working for doesn't allow for instance concern I would have or is that if this server had intermittent problems so I'd say an encounters maybe a hardware or a software problem and it results in the server restarting the system's got just enough time to start to move some of the virtual machines off onto other nodes by which time the server's backup and running again in which case it then has to start moving those virtual machines back again only for the server to then go offline so that's not really a stable situation to be in there is no maintenance mode per se at least not now that I'm aware of at the time of recording for prox mugs so this default Behavior means that as soon as the as soon as the server is back up and running you're going to have a server where it will actually try to put these virtual machines back on the server it's not going to come up in some sort of maintenance mode for instance like you would get on esxi for instance which would prevent that from happening so that is something you need to be aware of so personally what I would prefer to do is tick that box and say no failback but just to demonstrate that what I'm going to do is I'm going to start that server back up again and what we should find is the virtual machine sure enough does get migrated back well just like I was saying once node 3 came back online and became available a virtual machine that was originally on Nord number two has been migrated over to node number three personally I don't see that as a particularly stable strategy to have so I think the better option is go to edit and we'll select no failback for this group so what I'm going to do is I'm going to shut this actual node back down again and then eventually it'll detect that the actual server is no longer available it'll move that virtual machine off to one of the two nodes here and then what I'll do is I'll repeat the process again and what should happen that time is the virtual machine should stay on that specific node so this time virtual machine has been moved back to node number two I've actually started node number three backup again so what we should find this time is that one Central server becomes available because we've actually got no failback actually selected as an option although this is the preferred server in the group what the system won't do is actually trying to move this virtual machine back to node number three letting c not three is now backup and running and I deliberately left it plenty of time just to make sure the resource manager didn't try to migrate this virtual machine back to node number three but sure enough it hasn't now if I go back and edit that group again if I deselect that option in which case we're back to the default Behavior click on OK if I leave it long enough then what should happen is there you go it's automatically now trying to build that virtual machine back to node number three so it is something you can always change your mind on if you particularly want to now one more setting that you can use within High availability groups is this one here restricted now this group that we've got here is made up of three servers I would deselect nodes one and two what this actually means is that by default even though nodes one and two aren't actually part of the group if node 3 were to go offline the resource manager can use other nodes within the cluster that aren't part of the group so we'll test that out I'm going to click on OK so technically this is the only actual node that the actual virtual machine we've got here can use I power This Server off what should happen is that the resource manager will detect that server is no longer available and then what it should then do is to migrate the virtual machine over to one of these two servers most likely it'll go over to node number two because it's doesn't well at the moment doesn't have any virtual machines running on it but we'll leave that for a while and we'll see what happens well as you can see our virtual machine is running on node number two and that's just because we didn't place any restrictions on what nodes are available to the resource manager so ideally we want this to be running on node number three but node number three is no longer available and there are no other nodes available within the actual group to use so it's just gone with whatever other nodes that were available within the cluster but if you do want to be selective in terms of which nodes a virtual machine should be running on particularly if for instance you actually want to keep virtual machines apart you don't want to end up in a situation where two virtual machines are running on the same node then that restricted option does make a lot of sense last thing to point out about this restricted option is that This only affects the resource manager for high availability in other words it's not going to stop me from manually migrating an actual virtual machine to another node so if I edit this group and I put on restricted you can see as far as the group's concerned only node 3 is part of the group so if I click OK what that means is that if node 3 were to go offline well this virtual machine is going to stay offline as well until node 3 comes back up and running but if I manually actually select that virtual machine until the actual cluster to migrate that over to node number two for example it's still going to migrate this virtual machine even though we've got a restriction option placed on the actual group itself and the only member of that group is node number three still going to migrate this virtual machine over two node number two anyway now one setting that you might want to think about changing for high availability is the shutdown policy if you go to Data Center and then select options and click on HS settings and click edit a shutdown policy here by default is set to conditional now there is a drop down menu where you can choose other options and if you click that help button you'll get a tab here which gives you more details about what those actually referred to but what this default policy means is that if I go to a Nord and click on the shutdown button then it'll go through a graceful shutdown so what that means is it'll shut down the virtual machines it'll shut down the containers and then it'll shut down the actual node itself now at some point High availability will kick in realize that any virtual machines that were running on this actual node that are now actually shut down should actually be up and running in which case it'll actually start them back up again on another node within the group or the cluster doing trouble with that is you're not really maximizing the actual uptime for the actual virtual machine and that's what's really using a high availability for we want to minimize the actual downtime but we'll have shut the actual virtual machine down waited a while and then started it up from cold so what you can actually do is change that setting if you want to so go back to the shutdown policy here if I change that to my grade click on OK now what's going to happen is if I click the shutdown button it's actually going to shut down virtual machines and containers which aren't protected by high availability still but for those actual virtual machines that we've got high availability set up for they'll actually get migrated to other nodes first so it does maximize the actual uptime for these virtual machines that we want to keep up and running for as long as possible so there you go you can see it's locking the actual virtual machine so that it can migrate across to another node one thing to point out though is that it doesn't affect the actual reboot button so if for example we were to do patching on a natural node and then we need to reboot it well when you click on the reboot button it still goes to that same shutdown process all of the virtual machines will be shut down and then you've got to wait for the actual High availability resource manager to detect they're actually down and then it'll start them up from cold on another node so really that shutdown button is more for probably maintenance planned maintenance maybe you're going to replace a you know part of the hardware for example or maybe it's a case there's some work getting done in the area and you have to actually shut these actual servers down for whatever reason that's all that policy effects and it also it doesn't affect non-ha virtual machines just the actual virtual machines that you've got covered by high availability but it is still a useful feature now version 7.3 of proxmox ve introduced a new feature that you might want to consider for high availability so if you go to date sender then click on options and then here we've got cluster resource scheduling so select that click on edit by default it's set to basic but there is another option which is static load so if you click on the help button brings up this tab and it gives you a bit more information about this other alternative we've now got which is the actual static mode this mode what it does it basically it offers more more sophisticated way of finding out the actual load on all the nodes to decide where to actually migrate virtual machines to so the basic mode is just basically doing a count whereas the static version is actually checking things like the amount of CPU the amount of memory usage to see if it's possible to actually spin up a virtual machine on another node so it is a more sophisticated way and probably a better way really but as they say it's in technology preview so you might want to consider it in a maybe a test lab for instance but in a production lab probably want to be a bit careful well thanks for making it to the end of this video I really do hope you found it useful if so then do click the like button and share as that I'll get the video out to more people who might find it useful as well if you've got any comments or suggestions please post those in the comments section below and if you're new to the channel and you'd like to see more content like this then yes do subscribe just remember to set the Bell icon to actually send you notifications when new content gets released although I also post to Twitter as well as Facebook if you'd like to help Channel and support it you can actually make contributions through PayPal and buy me a coffee I've also got links to patreon and there's also the join membership option for YouTube itself patreon and YouTube members do have the option to actually benefit from Early Access as well but above all many thanks for watching this video I'll see you in the next one [Music]
Info
Channel: Tech Tutorials - David McKone
Views: 5,039
Rating: undefined out of 5
Keywords: proxmox high availability, proxmox ve, proxmox vm high availability, proxmox high availability vm, proxmox high availability shared storage, proxmox, proxmox tutorial
Id: hWNm4hYejqU
Channel Id: undefined
Length: 30min 37sec (1837 seconds)
Published: Fri Dec 16 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.