Is Your Hard Drive DYING? - How to tell

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right how's it going y'all so today we're gonna be talking about something tough and that is dying drives we're gonna be talking about how to tell when your hard drive is dying and when you need to replace it and some issues associated with that there's going to be kind of focus around Synology to ask though reasonably this is pretty easy to tell with anything just because it's going to be based off of smart data as well as just kind of finding things out and kind of seeing so first off let's talk about how hard drives die hard drives die in a very different way than ssds die hard drives or mechanical components and so as time goes on the likelihood of them dying just gets slowly slowly slower larger and larger and larger just until one day they just die you can have a hard drive that lasts for two years and I've seen drives that are totally fine running for seven or eight years and have not spit out a single issue there's no real Rhyme or Reason it just kind of happens it's truly random and so because of that it's very hard to predict when a drive is dying though you can generally tell with some indicators ssds on the other hand have a pretty calculated lifespan because the way the ssds die almost always is they just run out of Rights ssds can only be written to so many times though in this day and age that number is very large so it's a lot less likely to occur but eventually ssds can just kind of die but it's very predictable in general if you have five ssds in a raid they're all going to die about the exact same time because they're all being written to in Redrum in the exact same way Synology actually has a custom raid that they wrote just to get around this called raid F1 which essentially sends more rights to some drives than others I'm not actually sure how to actually balance it with the goal of killing some drives earlier than others so not everything just dies at once it's really not something you should be too concerned about but I just wanted to put it out there this video is really going to be talking about hard drives though so hard drives have a great piece on them that is called smart data that helps you learn what is going on but it is not perfect so smart data is the hard drive telling your Nas or computer or whatever hey I have an issue I have this many issues these are my stats and we kind of want to talk about the key things to look at whenever you're looking at telling if a drive is about to die and I'm going to dive into a great post by back Blaze here because they have more data than anybody so right here this is just my post on the forum sites that pull up the two really useful pieces and by far the one we're going to actually look at is what hard drive errors actually tell us for smart tests so backblaze has a bunch of drives I mean a bunch I think they've got about 200 000 hard drives actively in use if you don't know backblaze is a backup company and they also sell storage so basically they are a storage company so they have a lot of drives and they are a really cool company because they actually publish their metrics about drive failure rates and things like that so right here this is the actual smart data they use to tell when a drive is dying these are the key ones that they actually look at and what they have found to be the best indicator of a Dying drive it is really interesting here and I will leave a link to this article because it's great but they talk about all the different smart tests and How likely they are to show up in a failed drive and so here we can see the percent of their drives that have a at least one value in the smart tests that are still operational and the ones that have failed and so you can see they have high high correlation with failed drives so if you look at all back blazes drives that have one or more of the tests that have a stack greater than zero only four percent of them that are operational have them but for fail drives 76 of them have it and so it's clearly a great indicator of a dining Drive note not all drives support all the smart tests so you might not have all of them another really interesting one is the actual combination between multiple so here if you actually have multiple of those five smart tests that are failing then you're almost certainly dying and it gets higher and higher correlation it is very interesting and they are big statistics guys so I'll leave a link to this article but what we're really going to learn from this is hey these are the five smart tests that you should look at so this is kind of the anecdotal information for when a drive is going so I have a drive that I believe is dying and so I'm going to talk about my process for doing it so recently I was running a scrub of my first Nas and I got some errors I want to talk about my process for figuring out what these errors are and what the issue is so I'm going to go ahead and show them right here if we go into storage manager logs and this right here is my IO errors you can see I've got tons of them in a very short time span while the scrub is running so right there there's clearly an issue now it is not a guaranteed issue that's because this could just be a single section of the drive that had some junk data written to it and the scrub was reading it for the first time and was really figuring it out but generally this is a big indicator of a Dying drive all right so now after I've seen these logs what's the next thing that I did well the very first thing I'm going to do is I'm going to make sure the scrub actually finished so that storage pool won and we're going to see that it successfully did complete so that means that it was not all lost not everything was messed up so that is a good sign overall probably at least it was able to complete and that means that the pool is still operational still working well enough but that drive may be dying so the next thing I'm going to go ahead and do is I'm actually going to look at that specific drive so that is Drive number seven so I'm going to go into it I'm going to select it and run a smart test on it first off a quick smart test so we just go into smart and run a quick smart test right here quick smart tests take very short amount of time so we can actually do it live right here an extended smart test takes a much longer time and believe it or not I've actually really never had an extended smart test telling me that a quick test has not I still schedule extended smart tests maybe every three months but quick smart tests are generally good enough and should be your first thing though if you do find a drive like this and everything is still coming up good on the quick smart test leave it overnight and run the extended smart test just because it can tell you some more information and so as you can see we came back healthy you can also see that right after that that error came out I also did an extended smart test and it was also healthy okay so now we're saying all right the disk says it's okay I also ran Ironwood health and it said healthy as well alright so quick introduction here from Will from the future if you've not already seen my video on the WD red wga issue basically Western Digital has added wdda analytics regardless of what Nas you have you may have a tab for Western Digital device analytics this is on Synology though it is getting removed and it is now coming to qnap if you do see this test I would highly recommend disregarding it and removing it just essentially disable it and just do not listen to any of the data out of it I'll leave a link to a video that I go over it essentially what it is doing is it will flag a drive as consider replacing if it has been powered on for more than three years continuously this is not useful but you may see that and it may freak you out if you see that disregard it and I would highly recommend disabling wdda altogether all right back the video all right every check was coming up green the next thing I would do is I go into the actual smart attributes and we are going to look at 5 187 188 197 and 198. and so first off five all right so that looks good zero now we go on over to 187. and 187 is where you see that we actually do have some questionable issues so we have 145 uncorrected errors so that is times that the drive was told to read information and it failed to read it properly and had to say hey I can't do this 145 is also a lot this is not just one or two anything over 100 for this is a huge red flag and is more likely than not meaning the Thrive is dying if you just see one of them here it's really tends to be just a fluke maybe there was a small timeout or something internal that just did not read it right but when you've got this many it tends to mean that there is some issue going on and I'm also going to check 188 nothing and then 197 and 198 both zero so because I've only got one of those smart tests that are kind of throwing that warning I'm not going to be too concerned about it because it is just one if you have two that's when you basically just go in and replace the drive ASAP but with one you can kind of write it out but just know that this Drive is probably dying so the next thing I'm going to test to tell if a drive is dying so say we still don't really see much of anything here and you're worried you've got a dying Drive the next really easy thing to do to test it is actually to dump a bunch of data to it and see how long it is taking so I'm just going to go on my computer over here and quickly just run a Blackmagic disk speed test this guy's hooked up via 10 gig so we can just get the full thing and we're going to look at our disk reports so I'll be back in a second and doing that so a black magic disk speed test is just kind of a easy way of testing the throughput of it and so all I'm doing this for is basically just forcing a load on the drives so you could also just copy and paste large files to and from the nas to do this but what we want to do is we want to force all the drives be hit as hard as possible the next thing we're going to go ahead and open up is our resource Monitor and we are going to look at our disks and we're going to go into this view details right here and we're going to look for disk number seven Drive number seven in this case to be off than the other ones so see how everything's right about the same that is an indication that everything's probably okay what would be an indication of a failing Drive is that say all of these that are in the same rate group are hovering around 30 utilization during this hit except for drive seven drive seven I've seen it where one drive that's failing is at like 95 utilization that's because it is struggling to read data from it it's having a bunch of Errors internally and it is having a ton of trouble reading data from the disks and so that is a great way of telling when a dying Drive is it's even the place where I've had users who have had such bad issues where DSM completely locked up because the drive itself is saying no I'm good I can do this but it is unable to read the data so DSM just keeps asking it for the data and so when you actually pull the drive out DSM immediately becomes responsive so if you ever noticed DSN completely locking up like that you may have a dying drive and if you know which one pulling it out can actually instantly restore it to usability they'll only do this if you've got a redundant raid and you know the drive is dying and you've got a good backup just because you don't want to pull that out and it immediately desyncs and now if anything else fails you lose your pool you want to be careful with that but in times where it's completely locked up I've had to do that before so here we can see that overall my tests have shown that it's probably dying because it only has a single one of our smart tests that is having the issue only 187 I'm not going to worry about it I'm going to buy a new drive but I'm not going to replace it just yet just because hey it may keep working but right there that is a great indicator for this drive maybe on its last legs just because it did have the errors though they were all corrected and they did all continue on working and so I also would run another scrub again and see if the errors came back up for my case when I re-ran the scrub the second time there were actually no more errors and so that makes me believe that it was probably fixed and so in this case I'm going to leave it running for a little while but that's just because I have a backup going every single day to another Nas all in the same rack so it's not that big of a deal if I do have to restore from backups though I definitely am going to be buying another Drive all right well I hope this was helpful for people kind of understanding what I do whenever I'm telling it the drive is dying it's allergic to ask and kind of forensics you go through and really what information is useful in determining when a drive is going to die or not if you want to hire me for a project I've got a link for that down in the description below and if you have any other questions go ahead and throw a post on the forums at forums.spacerocks.com all right have a good one bye thank you [Music]
Info
Channel: SpaceRex
Views: 8,001
Rating: undefined out of 5
Keywords:
Id: NYNNyWUhzmU
Channel Id: undefined
Length: 13min 48sec (828 seconds)
Published: Thu Jun 15 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.