Destroying a Storage Cluster PART 2: A Catastrophic Failure with Recovery Process

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Is there a reason that you can't use the Gluster with only 1 up? Does it allow more storage than is in one of the storinators?

👍︎︎ 2 👤︎︎ u/[deleted] 📅︎︎ Feb 21 2017 🗫︎ replies

booooooooop booooop booop

👍︎︎ 3 👤︎︎ u/clumz 📅︎︎ Feb 21 2017 🗫︎ replies

A follow up to part 1 where we set up a replicated cluster with GlusterFS and ZFS using (4) Storinator XL60s and attempted to destroy a file transfer. Watch Part 1

Many of you requested that we take this a step further and fully bring the cluster down with a catastrophic failure, then bring it all back up to show no corruption. Brett replicated the process with a smaller build to show off the resiliency of GlusterFS and ZFS.

👍︎︎ 1 👤︎︎ u/cmcgean45 📅︎︎ Feb 21 2017 🗫︎ replies
Captions
[Music] hey guys Fred Kelly here from 45 rides and today we're doing a follow up to one of our past videos where I tried to fail our Gluster cluster while keeping a file transfer up in that video I had for Excel 54 inators fully loaded was eight terabyte drives in a distributed replica cluster and we were just trying to disrupt the file transfer by pulling that record pulling drive and even shutting down the server to no avail although that was pretty cool a lot of you wanted me to take it further and fully bring the whole thing down in a catastrophic failure bring it all back up and show that everything was where it was supposed to be no corruption or anything alright so this time I've got 3q 30 in a replicated cluster I've got 12 discs in each and I have nine terabytes of usable space now this is a much smaller bills than the last one I did but it really illustrates the point because I'm going to do the same thing that big or little cluster fest or just clustering in general is awesome alright so just a little background on the setup I've got here this is a two way replicated volume that is distributed across my three servers here meaning there is two copies of a file distributed across the total volume this setup therefore gives me a level of high availability as I can fail any one of the three servers and still access all the files but as well as the benefit of being able to scale greater than eight the capacity of any single server this is in contrast to a possible I could have built a freeway server a freeway replica sorry where I could lose two of the three servers plus I would only ever have the capacity of a single server so really what I'm trying to say is Gluster efest and our hardware so flexible that you can kind of build whatever you need to get the job done for you all right so let's get into it I've got a Windows client here and I've got my cluster as I entered I'm going to start a file transfer like last time and I'm just what I'm going to go to town on this thing no grace will shut down this time I'm just pulling power so we'll plop pankot fairways and grits excise big files I know what bluster does it really well so I'm on a one gig connection so I've maxed my line out so files running someone's using this thing take her down you know what see hiccups just a bit I'm going to go down then it comes back up alright so one failure not too bad let's say this guy got that they didn't put it on a surge protector so the drive got killed so poop-poop - this is stop making that noise so I have this in a 2x6 raid z2 and I just pulled three drives from each and it's still going so got a love replication and you know definitely not enough so there goes the power on that guy too and sadly now that that went down with one left not enough to keep the transfer up we've killed it but that's not enough for me so I'm just going to pull this got you this is the IQ man's worst nightmare this is this is bad news you can see tank is unavailable but the good news is we can bring it all back up safe and sound now we're going to bring it back up so just give you a brief over to what I'm going to do the hardware and these two should be fine they just lost our unexpectedly so I'm going to bring these guys up and they're going to reshare the share back oh and this one who experience device failure I'll have to replace the drive I'll rebuild the pool and then I'll riad it back into the cluster so typically one of your servers has catastrophic failure like this with the drive until you rebuild the whole pool your stuff will be inaccessible but because it's part of a bigger cluster we can rebuild it with ease to take our time fix the problem that happen here and then just add it back in like nothing happened again so we're going to get started so I'll just turn these ons on all right so our first two nodes are up they went down fairly on gracefully so I would imagine that this is a good chance of services didn't start correctly but you never know right so we'll go in we'll check the raid we'll check the service is an imperfect world in might you might have already started to share force but we're going to find out once we get these two up and running and they're sharing the service again and everyone can get working we will bring third node back up rebuild the race and add it back in like nothing happened all right so let's get started I've remoted into the first two nodes here and I'm just going to first we'll see we'll check the raid first we'll go they should be all right because I didn't touch anything discs but you never know so everything is online that's good through that your next online everything's mounted alrighty so let's check the service the bluster service that it left 4d and that's running very nice both of them okay so that means my volume should be all right make sure what's the vol sadder there's no TP EDD volume that's up and running tank up and running so let's see if the share is started dad it's me okay that's good news yep and so you two nodes are off is okay and the third is disconnected just as I wanted so let's make sure I can ping it good news so but that's pretty slick tired power just on reboot yep brought everything back okay so let's make sure we can write to it so that's all well and good reading but there we go we can make a new folder that's good news I'm just going to drag the files over because why not just be extra sure Oh George very close work pull the Blanco disc over oh yeah beautiful okay so just we're back up so everyone could get back to work now as far as they're concerned cluster sake however should one of these go down right now it serves the cluster would go back down so we got to get this third one up and running as soon as possible all right so these nodes are back up and running the share is accessible everyone's back to work so we got to rebuild the raid here let it rethink what I'm going to do is I'm just going to get rid of my my failed drives here I got equivalent replacements here so I can just pop these in wherever who three all right so I replace my drive now I have to go back tell the the volume to sorry not the volume is equal that's which out the old ones replace them with the new one it'll start a rethink and when it's done the resync we can add it back into the blessed volume and we're all done alright so I've plugged my disks in and see it's just you can see it yeah so you can see the yellow yellow entries are my new disk so let's check the status of ours v-pool ok so we've got four unavailable discs we just need to swap them out let's just start with the first one so it's just v-pool replace tank the first disk after place is 1-3 and I'm going to replace it with out here you knowing to get my mat back up okay the pool I want to be both very people alright so the first laps in place is disc 1 3 so slot 1 3 and I've got actually I put one right back in 1 3 so that's perfect I'll just replace it with itself and put this on three thank that's not the name of the pool and it's working away here okay so one three is now turned green which means that row partitions to it that's good is he pulled status zpool and now you can see one three back in it's still degraded because it still has three more desks three add in what good news nothing's too broken it that's the the pool replaced looks through the next one alright to the raid rebuilt and I just reacted it back into the CPV share so all our windows clients can get on the solitaire so yeah all nodes are ok we can ping Croatoan out if I spelled it right I can there we go things row it up so all three nodes are good and people are still on our share going to hang of it yeah there it is so failed the whole pool brought it back up Klaus is still there good news alright so we had our cluster ups and running and carrying the window share out to our clients we are moving files across and that we simulated every IT person's worst nightmare just pull on ungraceful power shut down pull on drive failure in one pod however the service stayed up until we lost power on two of our nodes so quite resilient but even better we were able to plug it all back in these two notes started really with no problem at all so we were able to get everything back up and running pretty quickly and you can take a little more time to rebuild your failed array and rejoin it like nothing really went to wrong so up we took what what would really ruin someone's month and made it an afternoon affair so pretty awesome and really what what it goes to show is my last video we did this we did this is a huge cluster almost near a petabyte but with this one we only had nine terabytes of usable space but the point is cluster or clustering in general open source tools and hardware's because you can just build something so crazy and resilient that big small great alright guys so thanks for watching it's got any questions or comments feel free to leave them in the comment section below if you're interested in clustering our coordinators are the most robust reliable and affordable servers out there give us a call send us an email we'd love to help yes thanks again you you you
Info
Channel: 45Drives
Views: 23,693
Rating: 4.955801 out of 5
Keywords: storage cluster, glusterfs, gluster cluster, storage server, nas server, network attached storage, zfs, file transfer, open storage
Id: otu9qhoOpaQ
Channel Id: undefined
Length: 12min 57sec (777 seconds)
Published: Thu Feb 16 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.