A Chat about Linus' DATA Recovery w/ Allan Jude

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] everybody had a lot of questions about how i did the linus tech tips data recovery well spoiler alert it wasn't super crazy but there were parts of it that were super crazy parts of it that were so crazy that i felt like that i wanted somebody to at least check my work check what i'm doing because i was seeing insane stuff enter alan jude clara systems if you look on github and you look at the work being done allen's doing the work welcome hi thanks for having me alan and his team and everybody else and so on and so forth but it's a lot of fun so i figured we could talk about the recovery and you know what i did in the bowels of the recovery and sort of the state of things when i started because it was an imperfect starting position and there were a couple of missteps along the way but also to talk about zfs in general and you know if you find yourself in this situation you've already messed up before now but um it's it's it's pretty good news for for linus because even though this is you know months and months later the reason it's months and months later is because linus sort of let me play with his pool in this messed up state even though we already got most of the data back months ago there's always more data to get back but you know as you sort of figure out each new dimension of how it's broken you you know sort of have to solve the puzzle and then you have to build machinery to automate over the data set uh to fix the puzzle in an automated kind of a way so it's sort of a two-step problem and i thought it would be interesting to talk about zfs internals a little bit and the tools for recovery and that sort of thing because zdb is hugely more useful now than the last time i had to do this uh yeah and so that's nice that's been really good and if you're in this situation using zdb is really good because it's outside the kernel if you if you're if you're hacking on the zdb drive or in the kernel and you get something wrong and the kernel crashes then you got to deal with a reboot or worse and when you're using zdb it can just sort of read everything directly and if it crashes oops that was a user land thing not a big deal you can still do recovery yeah zfs basically zdb compiles all of the zfs code as a user space program so that if it crashes you crash the program like a desktop application rather than crashing the whole computer and zdb has a bunch of flags where you could say you know these types of things where normally zfs would stop your system dead because something's really wrong and it doesn't want to corrupt all your data because zdb is mostly read only as well you can say when you run into these errors don't crash keep going maybe i can get something useful still and it becomes safe because you're doing it all in user land and not with the you know the kernel driver you're not going to be writing to the pool you're just kind of groping around on disk and finding what's there and seeing what you can make out of it yeah and that that was hugely useful just because things were so broken um the geometry of the setup was a whole bunch of drives in groups of raid z2 so a whole bunch of v dabs and a whole bunch of razi 2 v devs but there hadn't really been any zfs scrubs in years so um errors wouldn't be detected and there were failing hard drives which you know you'd expect after four or five years of continuous operation which which is actually pretty impressive it's like oh this basically operated with zero maintenance for many years on an ancient version of zfs i think it was like zero point six point something yeah and and it's just um and back plane failure which was introducing uh corruption in such a way that that more heavier loaded the disks were the more it seemed to introduce corruption which zfs was catching but then it started ejecting disks from those like oh this disk had a lot of i o errors and it's like well that just didn't really have a lot of i o errors it just the controller crapped out at that particular moment because the the or the back plane crapped out because there was a lot of traffic across the back plane and so all of those factors working in concert along with drives that actually were dying uh made the pool you know not importable at one point so that was that was fun yeah or you know if it is imported and you start kicking out discs because of uh too many i o errors it won't kick out the last disc that would then switch to a faulted state but suddenly you were missing a bunch of the other discs that might have been the bits you needed to be able to reconstruct a certain block yes and so you don't want it to always be kicking things out you know under normal operations you do want that because you know one slow disc can drag the whole v dev down if it's constantly taking a long time to reply but uh you know in a recovery situation things are a lot different than in a you know production situation ah it was fun we um what we ended up doing was the basically taking the whole pool offline and then stepping through each disc one at a time to just do basic diagnostics on each disc and as we encountered a disc that had either an excessive number of bad sectors or ios were taking a long time we would just clone that disk to a new disk just a literal block for block clone and then we would remove that disk and set it aside and that was sufficient to get the pool back into a state where it would import and things seemed like they were mostly working okay uh without any sort of you know other consternation to really have to worry about it um in in doing that across a bunch of drives there were actually two drives that had mechanical problems that did not finish their clone and so it's like well that's two drives and two different v-dabs i guess we could insert it and see if we're able to re-silver so we did that and the re-silvering was mostly successful but the original two drives still had uh sectors on them that would have been useful in the recovery because there were a couple of instances where other failures meant that we couldn't reconstruct some of the missing stuff and that's actually where we are now as like some of the stuff that is still kind of unrecoverable is probably because we needed to surgically move you know individual lba sectors it's like here's a the record size was 128k so we're moving these in 128k sets at a time but zdb works well enough to give you it's like okay here are the actual lba blocks that you're going to need in these devices in order to do this so it's like okay i can i can do dd you know worst case scenario you don't want to use dd but you can use dd to just go and get that 128k and surgically put it in the exact same lba numbers so when the zfs machinery comes along to look at it you can figure it out because figuring it out from the programming side of zfs is less fun than that yeah you know any raid zed recovery uh zfs is like well i only have to check some of the whole block so i know that this version i've constructed is wrong but i don't know which of the uh children of the video have contributed to the error so it basically tries every combination of ignoring one of them uh to and then using the parody to find find you know a version that does match the checksum but sometimes none of them yes and that's when you can run into problems and especially you know razer two and three that it gets to be a lot of combinations so it takes a while to find the right one and yeah if depending what was wrong you might not there might be no version that actually matches uh and that was one of the extensions we did uh for you for zdb was then just return your best guess instead of nothing like normally in zfs you would never do that right i give back the right data or i give nothing but again in a recovery scenario it's like if this is just one bit off or something in a video the video is still going to be mostly usable yeah yeah so what what we mean is like you've got a 30 gigabyte video file and 128k is messed up inside the video file you can just tell it to ignore that and then zfs will just not give you the entire 128k uh block but sometimes we were getting corruption on things that contain references to directories and it's like well let's try to do something with that anyway because we would it would be nice to be able to recover the files and so sometimes when the file is not recoverable from the directory view you could still access the file by object number in a data set or by block number and so you can just dump everything that you can get from your list of objects in a data set and be like oh there's my file there's the specific file that i'm looking for and you can cherry pick that out we did that with a few files but um i never did build a tool to sort of automatically try to do that because the zfs recovery ignore the fact that this file doesn't match its checksum if we turn that safety off and i still get a 30 gigabyte video file who cares if it's got a little corruption in the middle of it yeah because you know normally zfs would just stop reading at that point be like hey that file's no no access and then then you lose 30 gigabytes of data but it's yeah or everything after the corruption you can read the first bit of the file but then it's like nope that that's corrupt and and in production that's what you want yeah you never want bad data but in recovery you want to take some of those seat belts off and you know you're talking about the automation of that and that's what led me originally to start working on zdb uh in the open zfs developer summit hackathon i guess 2021 um i'd seen people like writing python scripts to drive zdb where they would like you know print out the indirect blocks for uh a file and then iterate through those in python and be like all right now read this lba and put it over here and then try to reconstruct the file i'm like well zdb has all the bits it's just somebody just didn't write you know the loop over that code where you could just be like here's a data set and an object number please extract it and write it over here and so i built that and it allowed you to recover a file and work nicely but i had this issue when i was building it of well i don't have a pool that's corrupt in strange ways to test this on to make it be able to do the right thing when some block isn't readable you do now yeah and then uh you know whittle calls up and he's like hey i have this pool and it's broken and we can give access to it we can try some of these tools i'm like that'll be perfect it is it has been really interesting to look at zfs at point blank range but also um to attempt to do recovery in these kinds of situations and like you know again to be clear lana's already got all this the stuff that he cared about back a long time ago and so i it's it's been sort of my uh desire to uh you know do a little bit more forensics at a point blank range because even though and you know even though we have some of these blocks and there's really minor corruption because the back plane i guess introduced corruption or because we lost a few bits here and there in the uh the 128k you know records um it was actually useful to do things like dump uh you know two of the copies of the metadata that was supposed to be for a particular directory or indirect block list or whatever and then compare them and say okay neither one of these will decompress but theoretically these blocks should be the same how do they differ on what bits do they differ and then let's just you know if that's a thousand possible combinations let's just build a little script to try you know all thousand possible combinations and see if we get one that will actually decompress correctly because that might be an option and then it's like boom we've unlocked a few more files it's tedious and a huge pain but those kinds of things are possible if you are really desperate or insane as the case turns out well in this particular case it was just such a good opportunity where this fairly rare but realistic failure mode happen and we have access to the pool for enough time to actually do something other than just get the recovery and get it back online uh and it made it [Laughter] you know it really gave us this opportunity to help flesh out the tools that are available when something like this happens so that you know when when it is a dire situation that you're not having to build all your tools from scratch still yeah if it was a dire situation people about hour number four people are freaking out and breathing down your neck and we did not have this in that in this scenario so big thanks to linus for that but also just the just the the curiosity of okay let's just see how deep the rabbit hole goes on this and let's just figure this out in terms of zfs and zfs internals this could not have been a better failure mode like looking at at the post mortem on this and looking at how we had you know multiple drive failures like physical drive failures in a pool plus a back plane that the harder you push the back plane the more corruption it introduces if you're just doing light io which you know is restricted by the network in a lot of cases like i'm just gonna copy files from here to there you'd never see any corruption but if you needed to do a scrub that's sort of the first place where things started to go wrong it's like well if you're not 100 sure about your hardware don't scrub because uh that led to some problems or you know zfs reacting uh maybe overly aggressively to an issue yeah uh and yeah um you know kind of getting back to what you said earlier the the first thing is if if you're in a situation where things are getting corrupted and so on uh stop touching it and and if you can work off like you said a a clone of the discs so that you have the originals and you can still send them away for you know more forensic recovery or something now that wasn't really an option in a one petabyte pool situation yeah clone every one of these discs and just you know have a spare petabyte laying around and do that but um to the degree you can you want to avoid making changes to the pool so sometimes that is even to the point of when you're trying to do imports make sure you're doing the import with the read-only flag on the pool so that zfs won't keep adding data like writing stuff to the disks and you just want to see you know how much can i get back just read only uh and not changing things uh yeah you know when i first got hands on the pool they had tried some things and it was necessary to roll back the transaction log kind of farther than zfs wanted to because it had already uh added some stuff to the transaction log that i felt like had uh moved us farther away from a working system and so once it was once everything was put together it's like all right let's just unwind all that stuff that happened back to here and that was a much more recoverable situation than what i started with yeah for sure uh and you know that's one of the slightly interesting things about the way zfs works with the transaction groups is because it was originally designed when discs were 5 12 byte sectors it would have the last 128 transaction groups in the in the uber block ring but because just have 4k sectors now we only get 32. uh and looking at ways to maybe fix that just so that in the case where you do need to rewind you can rewind a bit further uh or another idea that was proposed at a previous developer summit is you know while we want to keep the the last x transaction groups maybe the tail end of that ring should have like a grandfather policy or something where we you know keep only like a thousand transactions ago or something uh so that you know the last couple of things on the ring are varying degrees of much older uh so that you know if the last 30 transaction groups isn't enough to get you back for the problem because depending on how busy the system is you can go through your 30 transaction groups in a couple of minutes yeah and you know being able to at least step back half an hour might be a big uh difference between being able to import the pool and not that would have been handy here yes yeah so looking at making sure that ring gets bigger at some point uh using a couple of different uh mechanisms there and maybe having some policy of you know the very tail end of that ring maybe we have some kind of age out system so that we do keep a couple of slightly older uh uber blocks so that we can find where that metadata is yeah yeah that kind of thing is would definitely be super handy um you know the main reason it doesn't do that right now is because it wants to make sure when it writes an uber block that's an atomic right right we want to you know always write a whole sector uh and so we could do something that involved more of a like read modify right thing but we'd want to stagger it so that you know if we do have a short and right you know we were in the middle of writing uh an uber block in the ring when the power went out of system crashed or whatever um that the other uber blocks we took out aren't the neighboring ones right we took out you know every 31st or 30 second uber block in that list not a bunch in a row where now your best recovery chances all got blown up oh yeah yeah yeah i i have noticed certain models of hard drives sometimes don't really like the way we do the uber block ring uh you know especially spinning disks we're writing to the same offset on the disk all the time right so the the zfs labels are like the first 512 in the last 512 bytes of the disk and in each of those 256k labels is a 128k uber lock ring and so zf has written that like at least every couple of seconds to that same spot on the disk constantly the whole time the pool's there and i had a disc where the other side of that was a unused swap partition uh and eventually those disks developed read errors in that unused swap partition partly because i think those sectors had never been written ever uh and then the ones beside them are getting written to constantly and eventually it just that you they never got refreshed or something and they started you know the smart test would be like i can't read that sector anymore and i was i thought that was kind of interesting to see you know because spinning just don't have something like where leveling where they're going to replace the sector the same like eventually they do if it fails but it was interesting to see that you know all the uber block sectors worked fine still because you know they've been written too constantly and they're always busy but sectors near that that were never used were suddenly having issues interesting that is very very interesting it sounds like i only saw it with one model it could have been a firmware bug or just you know because of how tight the sectors are together and those ones just never got written to whereas if they'd actually gotten written to once a year or something they might have been fine i don't know that's you got to do the uh the patrol read rewrite constantly but you know now mechanical hard drives have a really surprisingly low right endurance lifetime like uh 20 terabyte drives are are only rated to be rewritten about what is it like 100 times their entire lifetime uh i guess it depends uh you know i think part of that is just trying to force you to buy the the pro versions of the drive i don't know but yeah so i did the version of the drive that's my sister but yeah i i did think it was a little weird to see spinning hard drive when they'd be like yeah it's rated workload is like 180 terabytes a year i'm like that seems low yeah yeah i'm a little worried about that but maybe it's maybe it'll be okay also worried about the transfer rate of this too because you know um in one of the doing one of the the recovery tricks was just using zfs send with all the safeties turned off because then it's a lot easier to read from the data set to the copy of the data set that's been moved from a to b because if you had errors in the first data set well it copies everything that it possibly can but the copy of the data set is more recoverable even if it's in the same pool and you don't do that unless you're to a point where you're sure that that's not going to cause any other problems but copying the data set you know even in the same pool um that was uh that was a really interesting uh recovery situation as well because the act of just moving the data set is like oh this is actually now more readable than it was before right because any and you know all the metadata is now correct uh it's either correct or missing it's not there's some damage stuff in here that i'm going to trip over yeah uh and you know it's on hard drives that aren't going to you know seriously return errors and so on uh and so yeah it's it's kind of in like you said with the other ones when you were cloning the drive the same data even if some of it's missing on a working drive is always more usable than maybe slightly more of the data but on a drive that's randomly timing out and and taking a long time or returning errors yeah a couple of those um the when looking at zfs at the point-blank range to try to because i was trying to get metadata blocks that were that were identical and a couple of times i would pull the 128k metadata block that was from a damaged drive and the entire block was there except for 512 bytes in the middle that was just all zeros and it's like well obviously the drive had a read error here and it just but it made it a lot it made it really easy to spot in a hex editor because there was a lot of information there almost random i guess because of the compression yeah uh and then you know if you have a second copy from a different drive and it has a different hole in it maybe you can stitch them together yeah that's where we ended up don't don't be like that make you know do your backups do your scrubs uh do your actually replace the failed discs when they fail instead of just leaving them there for to to rot in the back room forever have some spares you know just a few that was uh that was another thing too is um the chassis were all completely full and so it's like okay i would like to not remove the failing disk this is actually would you agree because i always tell people yeah don't take the disk yeah well always make sure you have a little bit of slack in your chassis because yeah when a disk is dodgy what you really want to do is an online replacement yes you're doing z pull replace but you have the old disk still working while the new disk is getting a copy of it because especially in the case where if you have multiple disk failing this way you're still maybe getting to be able to read from the the drive you're in the process of removing until it's until you have a perfect copy of it it doesn't get removed and it just lowers the chance that you know another drive clunking out is going to cause your pool to get suspended or faulted this is incredibly sage advice everybody you should definitely listen to this because it is a very hard one advice very good advice it's like don't spend yeah i would say you probably also want some cold spares or something like you know if you're buying a system and you're building with lots of hard drives order a couple more drives than you need so that when it's time to do a replacement you have some spare because as we've seen with supply shortages on it can be really hard to get the disks that you need in a hurry and you know those disks might not you know that model might not be available anymore and well you know zfs will tolerate you using different uh replacements or whatever just order a couple more distance you need and have them laying around is really helpful and leave some slack in your your chassis because it might also come up like with another pool we were having somebody to design in the future we might want to stick some ssds in here for metadata vdfs oh yeah and so having a couple of spare slots we can always add more hard drives later if we end up needing the space and don't use it for ssds but you can't remove a raid zed once it's in there so maybe don't fill up every single slot in your chassis as part of the pool and leave yourself a little room uh for who knows what's going to happen down the road uh that's also another piece of advice that i kind of wanted to get into well the metadata special device i love that i did run that myself but but more importantly metadata copies equal to it's not a default it probably should be at this point so uh most metadata is copies equals two automatically uh and uh the pool the pool level stuff like the stuff about data sets is copies equals three okay or actually it's uh copies plus one and copies plus two okay uh so if you set copies equals two then any data that's normally stored twice would actually stored three times because it does the base plus the extra copies for the metadata um but yeah you can't control the metadata only copies separately but there is the data set setting that goes the other way which you probably don't want to use for this stuff which is uh redundant data equals most versus all and that specifically for those indirect blocks the blocks that just point to more blocks above a certain level it will write fewer copies of those and that's specifically designed for databases where you know you're making like an 8k change and you don't want to have to write four levels of metadata yeah multiple times so you can reduce it but most of times you don't want that other than you know the special case of this is a database and we're backing it up differently and so on uh but yeah you definitely want to be careful with your metadata and that leads to the question about those metadata special v dev devices you definitely want mirrors for that and i usually suggest at least three-way and then i try to make it not all three the same ssd because it's a mirror you're going to write all the data to all the ssds the same all the time and so that means they're all going to wear out at the same time and that's not good and so yeah you probably want a three-way mirror and at least one of those ssds should be a different model or a different brand or something that's hopefully not going to wear out at the same time because that could happen there are firmware bugs let's not forget the 65 000 hours of power on and then it powers off yeah i recently saw one where a mirror of like samsung 970 or 870 evo ssds in somebody's home media pool both ssds started having read errors not luckily not in the same sectors but you know both of them were basically dying at the same time and we actually did the same thing of basically cloning those to two different ssds and replacing them uh and that stopped the pool from keep suspending itself because it was like two degree errors from that ssd kicking out of the pool and then too many errors from the other ssd and now we have no metadata i have a i have a striped mirror in my system in one of my systems uh of two terabyte ssds and there's one spare two terabyte ssd so it's like as soon as you start seeing errors in those ssds you know use this but they were all uh staggered i didn't have the luxury of using different brands but they have different levels of wear on them when i started so exactly that's that's basically the other option is you know add the third one after a couple months or something so that it's not uh at the same where where point and that's also really hard one advice just just do this tiny little thing in future you will look back and thank past you for yeah after a year maybe replace one of the ssds preemptively and then you can keep the old one as a spare and swap it in later because it you know it has some endurance left on it uh but you know or set up some monitoring with smart and watch the endurance on those uh when it's getting low you're going to want to preemptively replace at least one of the two in time and i've seen the same thing people uh build a pool with a slog on a relatively cheap ssd and then wear out its entire endurance and then suddenly it goes read-only and zfs is like well a read-only slog isn't very helpful yeah and then it's like yeah you should do something about that and you know that might involve just removing the slog temporarily and not losing all that performance benefit but you know the system's not very happy when your your slog wears out well i think um i'm trying to think if there's was there anything else that you can think of that was interesting about the recovery and in the end we got you know by percentages it was like 99.8 or 99.9 percent of the data back and we probably could get the rest of it it's just tedious lba block surgery and linus doesn't care at this point um but uh you know we have we have some time and we have some uh some stuff that we can do that maybe is a wider benefit to the community for um cfs recovery but you know i was trying to think if there was anything else interesting about the recovery yeah i think mostly looking at the features we would add to zdb based on it we have the improved version of my recovery mode uh for individual objects uh we have the you know return the data even if the check sums wrong and looking at expanding that to let you selectively try or iteratively try each of the raid zed reconstructions to see which one looks most likely to be have been the right one uh making sure that that's accessible i want a tool that will give me all of the metadata blocks and i can compare them and see okay here's how this was and hear this here's how this was and here's how this was although i never did find a pattern in the way that the back plane was failing just that the the checksums that were because one of the thoughts that went through my head was if the backplane was failing in such a way that like this packet of data that's in flight was coming from bad ram or something then these two bits are always bad or these three bits are always bad but that seemed not to be the case yeah it sounded like maybe they were just all zeros yeah a lot of the time which says maybe it was more of a problem at the transport layer probably like electrically or something and it just went bad or couldn't ec it or something yeah there is there is supposed to be in-flight data with that that's another dimension of this so these were all sata disks into a sas controller sata actually operates i mean if you really like get super into the weeds here sata operates at a lower voltage lower differential voltage and is less noise immune than sas signaling on the same cabling a lot of people are like oh it's the same cabling and it's like well one sas has a redundant path but the redundant path is handled at a protocol level so you're you know you can you can choose that path in the driver uh or the controller and the driver can agree oh something went wrong on this path let's try this other path none of those are options with sata and the other thing is that the at an electrical level the differential the characteristics of the differential signal with sas uh is better than sata in general and will push a signal over a longer distance so yeah uh but you know it's definitely an interesting case to investigate just because it was you know one of these real worlds failure scenarios not a theoretical one and we had enough time to be able to dig into it yeah and without a lot of headache you know basically i would say probably 90 of it was available immediately just okay let's take a deep breath assess where we are look at what's happened okay this is our plan of attack and uh yeah you know um jake was helping with the recovery i was doing it all remotely and so i was like okay swap this disc move this cable over here swap this disc move this cable over there and honestly probably spent more time just waiting on him to have time to do that then okay this is this is good and then finally it was like wait this chassis is also bad and it's like what have you done no so it's like okay then we moved everything over to the new chassis and that's where it's been for the last few months so things have gone a lot better there but you know immediately the machinery of zfs once you've taken the faulty hardware out of the equation you know the machinery of zfs basically out of the box 90 recovery of a petabyte that's pretty good yeah and you know that it's the correct data because of all of the integrity stuff in any other scenario you wouldn't know of what you got back was what you actually uh what you had actually written exactly yeah you know just even thinking about basic hardware raid of mirrors that you know your bios can do a real controller or whatever is it tries to write the data the same to both disks if at any point for some reason one of the disks isn't the same as the other uh it wouldn't know which one of those two blocks is right right it'll it won't wouldn't know they were different most of the time either uh so it can do you know it's patrol scrub thing where it will just rewrite the second disk to match the first disk to eliminate the problem but when it's doing that it doesn't know if the first disk actually had the right data or the wrong data or zfs because of the checksum it knows which one was wrong and fixes the other ones i've i've done videos on that demonstrating that and demonstrating that with different controllers and then i get these long emissives from the these gray beards i guess that work in this and it's like no that's not how it is at all i have this this you know fancy whatever and then they have the model and it's like it knows when there's and it can check and bought it's like no well if it's doing that it has an extra record it's using 520 bit sectors instead of 512 bisectors or it's doing an extra right somewhere or something in there and and and they're like no it's not it's like here's in the manual where it says you have to use 520 discs with this solution in order to get this feature and then i never hear yeah in order to fit that to fit that checksum and that checksum is a lot smaller than the one in zfs yes [Laughter] right the zfs is using a 256 256-bit checksum that's not gonna fit in that extra little sector and then people are gonna pop up and say oh eight bytes is fine and then at that point you're just arguing math you're arguing two plus two is equal to three and that you've lost so give up yeah it's like 64-bit checksums we stopped using those how long ago because of collisions certainly there would have been a lot of collisions here i mean that's that's kind of that's kind of what's happened it's like okay i've got a couple of candidate copies of this but you know nothing will actually uh at the wire level even there were enough checksum collisions that data with errors was getting through is what i'm saying and we can demonstrate that that was happening looking at the postmortem of this but you know constructing that argument is you know 100 hours in to a recovery like this because then you've got everything laid out and you can see well the only way this would get through that layer is if it passed the checksums of the physical layer like sata has the physical layer it's it's just like ecc though so it can fix one bit errors and detect two-bit errors and then after that it's like throws up its hands yeah well it's you're not after a two-bit error you're not guaranteed to even detect it which is demonstrably what happened here because we were otherwise getting errors and it's like oh let's retry this but consumer hardware a lot of the time when you get that two-bit error a lot of the time it'll just silently retry a lot of systems are configured even at the pcie level you know uh on consumer systems pcie address error report pcie error reporting is based it's not turned off but it doesn't report to the system because people were freaking out and saying why is this well yeah in particular especially like windows 95 windows 98 if if the hard drive ever returned an error when it was like your drives toast and would scare people and so hard drive manufacturers made the drive like well we'll try we try five times internally before we ever return an error up to the controller that maybe you know the driver and the os will try five times again so now we have 25 tries before we finally get an error and that's where we you know ended up with these drives with you know time limited error recovery where he's like please don't just go away for 90 seconds retrying and just give me the error there's still a lot of drives that do that consumer drives yeah uh and like uh talking about the other bit you're talking about where you know zfs was kicking drives out because of these errors that it turned out where the control the the back plane not the discs uh we're we just opened a pull request uh last week to make uh zed the the linux zfs event daemon uh more programmable to say you know you need to have this many errors in this many seconds before we take a drive out and be able to control that with the new uh vita properties mechanism oh yeah it'll be nice to use some user land scripts to tie that together with smart and it's like okay let's do a quick online smart test on this drive that has errors because that's what i did in a lot of the case for uh for linus's setup was okay this drive is producing a lot of errors that seem like they're read errors but let's actually take the pool offline you know shut everything down and then let this disk run with a long online or long offline smart test and so we just leave the disk alone for a while let it do its thing and if it comes back clean it's like i think this is a cabling backplane issue with this particular driver there's no guarantee of that but yeah if the drive didn't think anything was wrong then maybe move it to the other chassis and look it works fine uh but if the drive is like yeah i got correct corrupt sectors all over the place then it's like well even if the back plays the problem here that drive is also toast yeah yeah and that's what we had fun times but um step by step i guess what would what would your advice be to uh you know somebody in this situation probably hire your company to deal with it yeah well i think the first one is like if you can shut it off and stop poking at it and if it's practical work off a copy of all the data like image all the drives and work off that not always possible because you know if you're talking about hundreds of terabytes or more that gets kind of impractical but you know if it's a smaller pool and that's an option and make sure that you don't take any of your future recovery options off of the table yeah by having modified it too much work on the copies of the drive zfs does not care it's like oh my worldwide number is different cfs does not care no zfs has its own worldwide number written on the label on the disk in four places and if you clone the drive that'll be the same and it'll it won't even know yeah so if you can stop touching it if you are going to touch it work off a copy if you can and then yeah uh call for help because you know we might to sort you out really quickly without a lot of headache but if you keep poking at things you know the longer it's been since the problem the less likely are we able to rewind to a point before the problem i think when we started one was like yeah this has been scrubbing for a week and it doesn't look like it's ever going to finish and it's just like oh no it's like the pool's already read only what have you done well if it's read only the scrub wouldn't make anything worse unless you know the load just happened to make a hard drive mechanically fail but it went it went read only as a result of the scrub that's a that's not an ideal situation no um let's see what else what else is interesting what else is a if you find yourself in this situation zfs send actually this is also like kind of a side a branch cfs send as a mechanism for doing backups even if you have an absurdly huge data set like a petabyte and even if you're adding gigabytes of video per day zfs stand send is still the the least insane way to deal with that because it's your journal it's just it knows what changed from yesterday to the day you don't there's no cost to look that up you can just start sending the data yeah every block has a birth time which transaction this block was changed or modified at and so when you say give me everything that changed since this snapshot it's just can start sending out as fast as your disks or network uh can go right away whereas rsync has to like look at every file and be like when were you last modified okay let me read you and see what your checksum is well i'm also doing it on the other side to see if it's changed and it's like wow that's going to take forever uh and so yeah in a previous time doing a recovery they were trying to use rsync to do the recovery so that it would skip over the files that would have read errors but it was going so slow it was going to take them weeks and weeks to get this data copied off to another machine whereas when we switched to zfsn with the extra settings we added we made more tunables for zfsn so that when it encounters corruption it can just return a known string of you know this was a zfs bad block that feature's been around for a while but if it encountered a specific object you couldn't read it would stop any corrupt metadata would make it break down and we made a switch and say no keep going no matter what kind of thing and they were able to get most their data copied over the other machine in a couple of days instead of a couple of weeks and that was mostly just down to they only had 10 gigabit network i would also say that um zfs send also affords really interesting recovery possibilities in that if even if you had an ancient ancient copy of the data set and the object and you know you hadn't backed up in weeks in this scenario where we just need a few 128k blocks you know as a matter of chances there's a good chance that one of those blocks is exists on the old system and we could pull it off the old system and put it on the new system and it's like oh this data hadn't really changed in a while and it's like oh well actually a bunch of files in this folder did actually change it's like okay that would all be updated but we could still pull the files and pull whatever blocks we needed from the old data set and we would know because it's stored the same way on the remote data set as it is the local yeah and actually a feature for that got committed to zfs two or three weeks ago uh it's called like a repair send so if on your backup copy you end up with data corruption of a specific sector or it could be your master copy like one of the two copies goes bad it can construct the right send thing to just send the data that you're missing from the backup back to the primary or from the primary to to fix the backup where you can say you know this snapshot this blog is broken just send me that bit and you can actually repair broken data from your backup without having to do a full restore or manually try to do this especially if it's you know in a not the live version of the data that's amazing yeah so repair send got committed to to open zfs master i think two or three weeks ago i'm thinking about repair send to those metadata special devices as an extra way to hedge your bets that sounds really exciting well just remember that the metadata special devices are not an extra copy of the metadata they are the only copy so redundancy have to wazoo on those metadata devices we'll we'll send those off site in real time i don't really worry about the data but it's like let's send the metadata devices off-site in real time just in case it's like oh we don't have those blocks well that's okay we'll get those back eventually with linus's recovery basically out of the way uh let's look to the future you know what what's on the horizon for zfs that you're excited about what's coming what's coming down the pike i think my the biggest one i'm excited about is the block reference tree or brt uh which is basically per file cloning uh and also allows you to say move a file from one data set to another without having to copy all the data that's nice so it's basically a kind of opt-in better version of d-dupe uh so it kind of works is you can specifically clone the blocks of an object and write it to a different data set or the same data set um in a way where you only have to write the metadata and it keeps it in a much more efficient tree than the the dedupe one um and so yeah it'll basically let you easily clone a whole file so if you have like a vmdk image or something you can just clone the whole thing really quickly without having to clone the whole file system like you do now you know where currently i'd suggest people you make a separate data set for each vm so you can easily clone it but being able to do that on a per file basis is really interesting but also just being able to copy a file to a different data set without having to actually copy all the data when it's big or specifically if you're trying to restore a copy of a file from a snapshot currently you end up making a whole new copy of it rather than actually referencing the blocks you're keeping around for the snapshot anyway and so the block reference tree allows you to do that and would hook up to you know the cp ref link thing in linux or a similar thing in freebsd where you'd be able to just easily clone a file and it's much faster because all you're doing is writing out the indirect blocks the metadata for the file and pointing to the same blocks on disk and it goes in this block reference tree where it keeps track of hey if you delete one of the copies of the file we know we can't free that space because it's still in use by this other copy nice that does sound like an amazing feature that i would use the heck out of a lot yeah and that was developed by my friend uh pavaldodek who did the original port of zfs to freebsd back i don't know it's gotta be almost 15 years ago now um a lot of enterprises are going to benefit from that feature yeah uh and you know is is one of the only things that butterfs has that zfs doesn't so i'm not other than lots of bugs yeah there you go yeah i wasn't gonna say it but uh you know there we go it's uh and then also uh nvme so i'm always all about zfs on nvme and i wanted to go really fast but in the past when i've used it it makes an insane number of memory copies but you have a talk coming up and i will link to that in the description when it goes live which this video will probably come out after you've given the talk but we're recording it before so you should uh you can plug your talk a little bit and there'll be a link for it in the in the description yeah so not quite got to solving the mem copies as the problem but because of the way zfos was built and the fact that it was designed in the year 2001 and nobody had like flash wasn't quite really that big of a thing yet and then when it was it was we're gonna have these tiny flash discs that we can use a little bit of to accelerate the pool with like slog or l2 arc not i'm going to build my entire pool out of this flash that would have cost way too much money at the time um and so zfs does a lot of things to try to optimize for the fact that the limitations of hard drives right that your average seek time is going to be like four milliseconds or more like four milliseconds is low for a hard drive basically whereas for an nvme your read and write uh latencies can be less than 100 microseconds so it's like i said 4 000 microseconds is the best a hard drive can do and the nvmes can maybe do it in 50. or yeah or 16. yeah we're getting close to the day where we're gonna have single digit microseconds for these things um and the other big thing is the interface nvme uh so with sata and so on you basically you can queue up a bunch of commands but really the hard drive is doing one at a time right they can only move the head to one place do some work and then move to another place by queuing commands you can help it decide which order to do them in and some stuff like that but that's the limitation and sas is the same thing with nvme you have a bunch of different cues each which can contain a bunch of different commands uh and so you know even on your typical drive you have like 63 cues plus one for admin commands so they never have to wait and then a depth between like 16 and 64 for how many things you can put in that and so to get the performance out of an nvme zfs needs to actually fill those cues and give the nvme lots of work to do but because it was kind of designed for hard drives zf is like i don't want any of the queues to get too long because it might take too long like we don't if something important comes up we don't want it to end up at the back of a really long line but with an nvme that's really not going to happen and so you have to teach it these things and so in my talk i talk about some of those changes and what we did to zfs with some patches to get it to go faster and for the customer we did the work on uh they had a nvme over fabric pool and we got it from going you know two and a bit gigabits or gigabytes a second to you know like seven gigabytes a second nice uh which made a big difference the big thing was getting the latency of their writes from like 25 milliseconds down to like 13. oh yeah that's a big improvement that's hardest and you know there's still more work to do uh to make s better at it but you know it's just zfs needs to to learn how different an nvme is than a hard drive eventually it's just going to be persistent memory yeah and that's that's the whole other thing you know uh another open zfs developer did his thesis on persistent memory and has some you know designs on how you'd make uh the slog work better on persistent memory than you know the current way it's set up to work on like an ssd nice well thank you for joining me i really appreciate it it's it's been a it's uh it's been an interesting look uh and not super technical you can kind of understand what we did at a technical level to recover linus pool but the lesson here is definitely an ounce of prevention is worth a pound of cure check your backups monitor things pay somebody to monitor your stuff you know whatever the whatever the solution is but it was not a completely catastrophic situation even given less than ideal inputs you know zfs will try very hard to make sure that you don't lose your data in spite of whatever you're doing to to get in its way yeah the more experiences i have like this with zfs the more i'm convinced that there's really no other option if you care about your data because i've worked on other things and i've worked on those fabulously expensive enterprise systems and even those companies don't care that much about what like oh you should have bought our other thing that you know backs up your stuff off-site and because you didn't this is your punishment it's like well thanks large company that we paid millions of dollars definitely uh if you need somebody to babysit the pool clara systems is uh they're available but don't don't don't don't don't pull me into something that's going to take hundreds of hours and melt my brain because oh boy but you can do it with them don't be like lioness and just ignore your pool if you don't want to manage it and maintain it you can you can hire clara to you know come in and check on it all the time and send you a report every month saying you know this is how it's going this is what you need to do this is you know when you're going to run out of space at this rate and you know somebody needs to watch the pool it doesn't have to be you but somebody needs to watch it otherwise you end up in this situation you got to keep an eye on things yes i would wholeheartedly agree with that clara systems alan jude thank you thank you and we'll see you later i'll be in the forums you'll be you know you can probably be reached for phone support but you need a retainer for that [Music] you
Info
Channel: Level1Techs
Views: 55,086
Rating: undefined out of 5
Keywords: technology, science, design, ux, computers, hardware, software, programming, level1, l1, level one
Id: uYAezxwIxUw
Channel Id: undefined
Length: 52min 11sec (3131 seconds)
Published: Wed Sep 14 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.