What to do with a Degraded ZFS Pool

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so there you are minding your own business when you discover your proxmox server has a failed disc luckily you were smart enough to run your discs in a mirror so your data's not been lost and the server will still be up and running but since raid is not a backup time is now a factor to making sure it stays that way today we're going to simulate a dis failure and show you what steps you should take to get your ZFS mirror back up and running all right we've got some actionable items from the board so listen up we've been working on collaboratively disseminating creative thinking so we can Implement multifunctional customize paradigms to revolutionize our compute processes now I have no idea what any of that means but we need an answer to them by the end of the day so suggestions uh what about vulture well that depends Does it include corporate buzzwords well vulture is the world's largest privately owned cloud provider yes the cloud I love it they say the cloud is good for synergistically transforming enterprise-wide core competencies I don't know about all that but with vulture we can instantly roll out high performance Cloud servers with their one-click deployment tool they have plans for virtualization or bare metal instances object storage even GPU compute for AI accelerated workloads hm I like what you're saying but could you translate it into a language that I understand with vulture we can quickly revolutionize our compelling processes with customized virtualization platforms increasing Roi through cloud-based resources all right well I think that's lunch with vulture you can skip the corporate talk and get right down to business with 32 data centers around the world they'll have an instance near you and your customers whether you need a single VM or a full Global rollout visit G vulture.com craft and get a $250 free trial for your first 30 days again that's G vulture.com craft and a huge thanks to vulture for sponsoring today's video Welcome Back to craft Computing everyone as always I'm Jeff today we're going to talk about something that's not often discussed as part three of my proxmox 8.0 series what do you do in case of Hardware failure let's face it home labers and more often than not a large majority of small and medium siiz businesses often have used or inexpensive server Gear Running all of their operations which can lead to a higher rate of failures so what do you do about them when they happen my current recommendation for running proxmox is to install it on a ZFS mirror or a ZFS raid to store both your OS and your storage drives on now this video is also making the assumption that you're going to be using ZFS and not the older lvm for your storage pools for today's example we're going to be using the DIY $1,000 home lab server this guy right over here that I built a couple of months ago inside it are two 1 TB silicon power a60 nvme ssds in a ZFS mirror that are responsible for booting the us as well as hosting all of my VM dis storage so why are we doing this video right now did I use a cheap envme drive and now I'm suffering the fruits of my labor actually it's nothing like that last week I needed a 1 TB drive for a PC I was building for my nephew and Amazon decided to deliver the disc that I needed for that PC 2 days later than I needed it to do the video shoot so I did what anyone would do I ripped the 1 TB nvme drive out of the server knowing I could replace it in a couple of days when the new disc came in plus I could film the entire process for all of you to see win-win so let's go ahead and dive into prox Mox and see what exactly a fail Drive looks like and you'll notice right off the bat that one of the shocking things about proxmox is there's no current Drive status anywhere to be found if we go to the data center Tab and click on summary you can see that this is a view of all of our current servers in this proxmox cluster and the server is reporting as online and no errors in fact the only error we see on this entire page is in the subscription tab because I don't pay for the proxmox Enterprise subscription repository by the way I will be doing a quick blurb in a future video about how to get rid of that uh warning and get on the mainline repository with all of the updates that you need so make sure you're subscribed still on that data center tab if we go down and click on storage you can see that we do see a ZFS pool is present and usable on This Server cluster but it doesn't show any errors either although I know this ZFS pool is degraded even further we can go down and click on our home lab DIY server which is what I named that server over there and same thing on the main summary page there's no explanation or even indication that we might have a problem here it's only once we go down to Diss and then click on ZFS pools we can see that the health status for this ZFS pool is set as degraded double clicking on that it's pretty obvious to see why we have one nvme drive online and one nvme drive reporting as unavailable because well it just doesn't exist anymore this right here is why it's important to set up monitoring systems to both check and report to you about system Health now I've done a quick tutorial on observium before here on the channel but I've not explained proxmox integration in that maybe that'll also be a topic for a future video so now that we know we have a degraded raid there are a couple options to move forward just because a raid is degraded doesn't mean you have a failed disc and this is especially important if you have a home lab or run a small medium business where you're not going to have a service plan or necessarily warranties for all of your Hardware all the time one important thing to take a look at is the read write and check some errors which are all reported both for the entire pool and then per dis down here below now as you can see in my instance there are zero errors on the entire pool which means my problem is not that a drive is failed although in this case it is because the drive is just gone uh but but because there exists some other issue with raid Integrity even as few as two or three read or write errors can cause your raid to become degraded that doesn't necessarily mean that a drive had a failure it just means that a particular read or write operation didn't succeed successfully the first time if you have just a couple of failures and usually my overunderish double digits if you have nine or fewer read or write errors you can probably try to recover that data or at least do a verification and data Integrity check on your raid to make sure that that everything is in good working order if you have over 10 errors it's probably a good time to start looking at physical drive failure as they tend to stack up rather quickly if you have zero or a very low number of Errors you can check Integrity across all of the blocks on those drives by running the command zpool scrub DV and then the name of your ZFS pool which in the case of proxmox is going to be named our pool this will compare all the parody bits between the storage blocks and determine if the check sums between them are all correct if the scrub comes back without errors you can type in zpool clear to clear all the errors and set the status of this zpool as clean again an indicator that a drive may have failed is its smart status which you can check on by going to the diss tab under your local server now as you can see my home lab server does have quite a few discs on it and if we pay attention right up here to my nvme drive you can see that the smart test has passed which means that according to smart and the hardware monitoring inside the drive it doesn't think a failure is imminent in my particular case I am missing my second nvme Drive entirely which is a clear indicator that data can no longer be written to it for the sake of this video Let's Pretend I already determined that my particular disc had failed now let's go ahead and replace it now if you have a dis that is failing and it is a hot swappable drive you can set the status of that disc to offline by typing in zpool offline rpool which is the name of your pool and then the dis ID of the disc you would like to remove if the drive is s swappable you can then remove the disc from your server and replace it with a new one in my case since we're going to be replacing an nvme drive we will need to shut the server down entirely to swap out the disc before we move forward so let me go ahead and do that right now all right so we have the drive replaced the server is booted up and now we can get to actual reconfiguring this zpool mirror so first up I'm going to type in Zool status one more time because we're going to need this drive path right here this is the path of the dis that is no longer present on this server but long file path aside replacing the drive is fairly simple I'm going to type in Zool replace then the name of my zo poool which is RP pool followed by the dis that I'm going to remove so that is this very long path right here we're just going to copy and paste that followed by the dis ID that we're going to replace it with and for that we can use the much simpler path inside of proxmox that is Dev nvme0 And1 so slev nvme0 And1 when I hit enter it should automatically remove the old dis add the new disk and then begin a resilvering process that is synchronizing both diss together in a brand new mirror and to check on that I can type in Zool status one more time one thing I love about nvme drives is just how freakishly fast they are uh resilvered 13.2 GB in 12 seconds zero errors were found both drives are now online and our zo poool has been repaired now obviously spinning discs or any kind of pool with much larger capacities is going to take a much longer time to Res silver but in this case all I have on here is my proxmox OS itself plus the OS VM discs for a couple of other VMS so it's not very much data at all now this process isn't just good for ZFS mirrors this also works for any type of ZFS disc that uses a parody so all of your different types of ZFS raid like raid Z2 and the like it is the exact same process it is the exact same steps all you need to do is make sure that you have a drive that is large enough to support your raid so in my case I had a 1 TB nvme Drive I need to replace that disc with at least a 1 TB nvme drive it can be larger but it can't be smaller and that's pretty much the basics of it and it's why I like ZFS so much is the commands are incredibly intuitive they're very simple and the entire process is basically seamless the user experience of replacing a drive in one of these pools so Night and Day from the old Hardware raid configurations that I've used in the past and on honestly software raid really should be the way of the future there's no Hardware requirement or Hardware level acceleration or Raid controlling that's happening that prevents you from migrating this ZFS pool to another system or swapping out parts that aren't necessarily identical matches for what you already have it's an open architecture it's an open file system and it's lightning fast repair any issues if and when they arise but like I said sometimes the first step to solving a problem is knowing that you have a problem at all and so keeping tabs on your system health and having good monitoring systems in place is Paramount if you want to keep your systems Problem free as I mentioned I have done a tutorial with observium before and you can click right up here if you want to see that I will be updating that for proxmox integration later on so make sure you're subscribed so you don't miss that future content that's going to do for me in this one thank you all so much for watching and as always I will see you in the next video cheers guys beer for today is from exnova Brewing out of Portland Oregon or at least it will be until they move all operations to their New Mexico Brewery uh this is the proper fence pilsner logger with New Zealand hops clocking in at 4.8% so I don't drink a lot of psers and loggers on the channel and it's not because I don't like them it's just that cost-wise I feel like I get more for my money buying a double IPA or an Irish Red or you know even a stout or a porter or something like that it is just as expensive for a craft pilsner as it is for one of those other style beers especially from a brewery even like exnovo um and at 4.8% and being a pilzer it's not exactly the greatest vessel to give you a lot of different flavor profiles and unfortunately that's very much the case here this is very much a drinkable beer but I don't think it's all that much better than a reineer or a 10 Barrel Pub beer or something like that it's a very standard pilsner with very standard malts and it's it's perfectly drinkable it's very good I just don't think it was worth $12 for a four pack but to each their own cheers
Info
Channel: Craft Computing
Views: 36,166
Rating: undefined out of 5
Keywords:
Id: IQA7aTezrVE
Channel Id: undefined
Length: 13min 27sec (807 seconds)
Published: Wed Nov 15 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.