We’ve NEVER done this before… - Mother Vault Part 1 - JBOD

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
yeehaw this is one bucking bronco of a server okay that didn't work but it doesn't matter because we have a serious problem and this this is one serious solution in spite of all the issues we encountered with our archival storage system dubbed the vault we've actually been able to recover nearly all of the data spread across the 2.4 petabytes of raw storage that made it up except that because of the recovery process and the sheer amount of capacity required to hold it all that data is now scattered to the high winds across the five discrete servers that are required to hold it some of which are getting close to a decade old we had no desire to return to the admittedly imperfect setup we had before which means it is time for oh oh mother vault so called because this is one giant mother the final deployment is gonna be not one not two but three of these each capable of housing nearly two petabytes of hard drive face storage is it time to bring back holy did it ever go away no it didn't just like our sponsor crucial don't just work faster work better when you need to open transfer or download files faster get some great ram with less lag time and more efficiency with crucial get your crucial ddr5 ram today at the link down below [Music] before we get to the new stuff we're going to have to get you guys up to speed back in 2015 we built the first petabyte project or vault cluster it was composed of two 45-drive 60-base storinator servers holding 120 10 terabyte hard drives giving us a total capacity of 1.2 petabytes raw these were combined into a single network share using a file system called glosterfest a few years later we built new vault or petabyte project 2 which used 75 16 terabyte drives across two more 45 drive storinators again they were joined together with bluster however since those two clusters were built several years apart we ended up setting them up as two discrete clusters that meant we had two separate network storage shares which you can kind of think of as having two separate drives in your system so if you want to find a file and you don't know which drive it's on chances are you're going to have to search through them separately this was fine at the time because we could usually guess which server a project was on based on how old it was and honestly searching just two sources is not that difficult however if we went to scale that up again and build a third eventually a fourth cluster well would start to get pretty annoying not to mention inefficient for the staff who have to use the freaking thing the other issue is that managing even a single system takes a good amount of time and every time we scale it up in this manner it's another set of components cpu motherboard ram power supplies ah oh my god this is really heavy that's the heaviest it's 130 kilograms i think that's when it's full holy crap you ready oh my god i told you you had the heavy side yeah but what the hell anyway more hardware means more potential issues more potential service so with that in mind we've been looking for an alternative solution for some time and while we could have continued to use gloucester fs and expand the existing clusters just don't really have anyone on staff who is familiar with it and it's not that user friendly i mean heck even 45 drives the company who provided us with the original petabyte project have actually switched primarily to a different cluster solution called seth now ceph is cool really cool we use it for floatplane for example and it's a bit better for scaling up but it shares a lot of the same problems of our old setup we don't have a ton of people on staff who are experienced with it and it would have the same or similar maintenance requirements so instead we've opted to switch to a configuration that many home labs and data centers adore one i personally have actually never used the jbod you can fit so much hard drive in here this magnificent beast is the super micro 947 he1c dash 42k 05 jb od it's a 90 bay chassis that as the name jbot implies is quite literally just a bunch of disks there's actually no computer in there which raises the question what is here in the back where the compute module or multiple compute modules would normally be this is it this is what passes for the computer in this chassis and there there is some computing going on here we've got some kind of asic that is i mean is this an fpga even i'm not sure it would be for sas though yeah it could be could are they doing enough oh oh gosh uh let's see you don't want to break it no i don't want to break it i just want to see it there's a button cell battery in there this chassis does have ipmi so that's what the battery would be for keeping track of time so there's going to be some compute for that probably on the bottom board but the rest of it should just be sas stuff there's that ipmi management port and then the only other i o on this machine is these six mini sas external ports what are these 88 so each of these is four lanes of 12 gigabit sas so it's 48 gig per port and you can use up to three at one time for connection to the head server which is what is that 144 gigabit okay casual perfect then oh you can see the a-speed uh chip for the ipmi right there sit in there there'll be some uh non-volatile memory on there too somewhere it might actually be this that looks like a nand package these are some chungus cooling fans look how thick they are there's two in there right 80 millimeter fans i actually don't think so it's dual blade but no i think these are individual fans jake oh no it's got to be two it's got to be two there's two discrete connectors either way i guarantee you they absolutely rip and then aside from i guess these are probably the sas and that would be your power that's pretty much it very very simple and that's with intention it's designed to be reliable right our other module here is well doesn't have a whole lot going on to be perfectly honest with you what on earth does this do you've got a small controller board that has a couple sata ports a couple nvme ports and more power and whatever the crap this is this is probably pcie i think when you buy the computer version of this these two nvmes would come over here somewhere if you just had a single compute node but yeah i'm not really sure okay you could have a jbod that's something called dual path now this is a single path which means there's one path for the data to go from the drives to whatever computer it's plugged into because remember there's no computer in here it's got to plug into a different computer yeah but in a dual path there's two connections and you'll actually have redundant sas expanders which is the part that splits into multiple drives you'll have redundant connections so there'll be six more back here and you'll have another ipmi and the reason for that is you can have a high availability controller so you could have controller 1 and controller 2. and if controller 1 had problems it would just fail over to controller 2 and you'd have better availability the problem is that in order to take advantage of that we need to use sas drives and that's a hassle well it's not only is it a hassle but we already have 3.6 petabytes of seda drives that's a hassle that's a that's like 150 000 hassle not doing it now let's take a look at the sas expanders that make this magic happen naturally they're hot swappable because servers and in a nutshell each of these takes in four of those sas 12 gigabit per second lanes and then is able to split that bandwidth to up to 30 flippin drives so we only actually need three of these to populate the entire 90 bay capacity of this j-bond by the way jake was talking about how you could have a dual path well you'd need three more expanders then but we don't need them jake do you have any idea what these connectors are called i have no idea shout out them because these things are flipping cool look at the way these are joined back onto the board that is dense with two s's that's oh my god the pins on the bottom right holy crap the next thing i need jake is for you to get me the head it's it's right there right here you can think of your jbot kind of like an external hard drive or in this case many many external hard drives which means that as jake alluded to before we need to have a computer somewhere well that is where the head or the controller server comes into play this particular machine is a placeholder supermicro is actually sending us over a dual amd epic milan server that's a little more tailored for this kind of application so instead of being full of nvme drive bays in the front it's got its pcie lanes allocated to allow for lots of hbas and network connectivity in the back because hard drives may be slow but when you hook up enough of them you can be pushing some serious freaking data and this is going to be connected to up to 270 drives not 255 i think that's what it works out to 90 times three is 200 and we don't have 98. oh but yeah i mean you could you get you could say you could add more drives it's just jbod um so cool right in the meantime for the purposes of our demo we've loaded this machine up with three of these broadcom hbas these things actually contain very little logic compared to raid cards which had to perform parity calculations on the data running through them these do almost nothing they just take your bandwidth from your pcie slot in this case pcie 8x gen 4 slot run it through a sas controller and then that breaks out into for a 12 16 sas connections do you wanna see my wonderful cardboard yeah do you wanna see my wonderful car do you wanna see my wonderful cardboard yeah this math actually makes a lot of sense if you have 90 drives and we'll say optimistically they're running at 250 megabytes a second that's 22.5 gigabytes a second it's pretty fast conveniently three sets of 48 gigabit sas cables is around 18 gigabytes a second and even more conveniently pcie gen four by eight it's roughly 16 gigabytes a second in either direction so we're going to be giving up some of the theoretical maximum performance of our drives but given that this is only under perfect conditions with brand new high-performance drives with reading and writing sequential data with nothing on them this will be lots and that's times three because we have three j bonds we're never gonna get anywhere close to this maybe one gigabyte a second i'd like to point out something else that's convenient what check this out see that ratchet back pressure wow i can start a screw with it and in spite of the silver shaft look at that i used it and my dick didn't fall off i actually think the silver shaft looks better wow all right he's actually been one of the bigger skeptics internally about this whole project i'm just a little bit critical but that ratchet i want it to be good okay you can sign up for a notification when it comes in stock lttstore.com try to sound less stressed in the meantime we also have shirts and stuff cash flow's a little tight now you might look at this machine and think that's ludicrous overkill surely the one that super micro is going to send is a little more pedestrian but that's not actually the case even though we're not running super high speed nvme drives when you're hooking up hundreds of hard drives with the potential to expand to hundreds of more you need cpu and ram galore more than you'd think in fact we actually have two 32 core epic 75 f3 processors those are the frequency oriented 280 watt per cpus and we're going to have around a terabyte of ram with the potential to expand that down the road because we will be using zfs and zfs greatly benefits from ram for recaching we'll also have nvme drives and stuff but that's tbd you'll have to get subscribed for the next video when we actually deploy the whole are we doing tiering no oh well maybe one day it's possible let you know a secret we might build a highly available wanik and set this up as a tiered archival layer entirely transparent to the user so just one drive yeah letter one drive oh my god these are just these are just drive sleds i thought these were boxes of hard drives jake no those are the traits i've been deceived do you want to see what they trade like we were talking about how we wanted to discuss yeah we also need the high performance cpu for zfs compression i was like oh well we'll just dump a bunch of drives in here and then we'll talk about okay we've got this much capacity but we're going to use compression and then jake's like oh we've only got 13 drives to put in it for the demo come on jake it's almost like all our hard drives are deployed holding data temporarily from the old vault yeah almost okay well i guess it's a 13 drive demo if you want ed to yell at you we could take apart delta 5. that's got new vault on it i think and some of old vault when ed yells at me wendell's still working on delta 3 so that's not available temp vault we could do temp vault but then ed would also yell at you probably it's not fit 13 drive seems fine all right we found another solution to some test hard drives this here storinator was full of 60 spinnerino's and now they're in here it has made the balance of this system super sketchy like oh it gets worse look look at that look it's it [Laughter] oh god oh i can't reach okay there we go but there you go there's 60 hard drives they're kind of in a weird orientation because there's a kind of a poop mix of drives in here there's some tens there's some 12s there's some 16s i think there's even 120 so i just wanted to kind of separate those but this is our test now these 60 drives do already have an array on them and i think it's about 70 full this is some of our existing archival data but they're going to be set up as four v devs that are raid z2 15 drives wide not great for performance i don't think we're gonna see anything crazy uh in terms of numbers out of this machine but it is kind of indicative of what we're gonna have it set up as we're probably gonna switch to 10 drive raid z2s just to make it a little bit faster but this should still be pretty dang fast now in order to get a connection between our server and our jbod we have these these cool mini sas hd external cables from infinite cables i think supermicro also sent us some by default this jbod's configured as a single zone now you can do kind of cool stuff with zoning in a jbot for instance you could split this up into two zones and have two controller servers being serviced by one jbot so you'd have 45 drives on either you can also do three zones so that'd be 30 30 30 do three controller servers but they can't access the other zone's data you'd only get the zone that you're actually physically attached to so you'd have a c or e on the back of this server now in our case we're going to be using three zones but to one controller server and the reason for this is it allows us instead of getting one cable which is 48 gigabit bandwidth we now get two and then three for 144 gigabit which again is probably overkill but what do we do that's not overkill hi you might need these like like for realsies no don't just wait this is just the power stuff that's the power supplies that's like full tilt though [Applause] whoa this is ludicrously loud so i put it in the server room with the door shut you can hear it in the bedroom but this is like they're actually at a hundred percent right now they should never run it never and with the 60 drives i already tried it like running a load and they barely even spin up okay that's good yeah i got a little worried is if the air conditioning ever fails in the server room we'll know because we'll hear this everywhere there we go oh give it a sec okay coming back to planet earth that's a lot more reasonable i mean i guess that's what fans that thick will do for you well i mean when you have 90 hard drives you're talking like at least a thousand watts yeah that's a lot of power i mean what are these power supplies uh they're 2 000 watt on 208 volt 2 000 watt power supplies um so to have like actual power supply redundancy you need to be running 208 volt wow otherwise you'll probably use enough power that it needs to split across the two so we might actually have to rewire our server room to be 208 volt because for some reason it wasn't wired that way in the first place i don't know i don't remember why brian there's probably a reason yeah there's a reason and i bet you there's a way to fix it oh my god yeah you can fix anything with money mm-hmm you have lots of that right now right no oh well that's fine we'll figure it out you know what no we should just put these in lab two because that one won't be too late right time to put more insulation on the walls jesus with nails yeah i'll show you that quick config here this is where you can change the zone so there's single zone two zone or three zone again like i explained before you could use that for multiple controllers or to increase throughput we're doing the throughput route of course that's pretty much all you can do in here there's no remote control like there's no there's no screen it doesn't have a display output right yeah power it's just a big dumb big discs system critical i don't know why probably because you pulled the power supply probably plug it back in i'm sorry about that buddy here you go yeah there's your medicine right in the butt suppository medicine power cable now tell me something if we decided to go high availability in the future nothing would actually prevent us from adding a second interface unit back there yes a second head server yes no problem right we need sas drives sorry not high availability at the oh cause dual path oh crap okay yeah the thing about this approach though is we've already proven we don't really depend on this yeah it's been like it's gotta be like eight months at this point that we haven't had it it's funny how many people were listening to us in the video saying yeah it's really optional retaining this data we lost it because we don't really care that much and we're like wise like no we haven't had it for like almost a year yeah at this point and it's been totally fine it is nice to have but here's nice to have if this controller server were to poop the bed or to have some problem well everything here should still be we can just unplug this we'll have some other server temporarily so like a cold spare put the hp yeah a basic spare server i mean you're talking maybe an hour of downtime like realistically i was looking at this and i thought maybe this was i saw four v devs i thought that was the four drives i i just saw it on the corner of my eye and i thought you had put the 30 terabyte nvmes in here we might though wendell's been talking about this special metadata device for a while and basically he was saying he has like half a petabyte with 24 terabytes of special metadata device in zfs and it fills so i'm like we might as well just put 430 terabytes in there and that'll give us 60 terabytes what does special metadata do so you know like file metadata and like the directory structure is that it's like searching really fast it makes searching way faster oh that's cool so stores metadata like where files are she's talking about it right here it's literally literally windows this is the first thing that comes up hi wendell yeah yeah but here's the thing is we can have a level two arc we can have a log device and we can have special data yeah so he's saying 172 terabytes of space is five terabytes of metadata okay but here's the thing we don't have to have all of it metadata in there like it'll pull it when it needs it right okay yeah by default this includes all the metadata the indirect blocks of user data and any deduplication tables stored on that device so maybe deduplication if you are running deduplication this is probably really important to improving performance yeah but either way we'll try it as you can see it's about 70 full not ideal for performance and on top of that it's an existing pool um that it's set up in four v-depths right yeah so they're 15 wide yeah so really i'm expecting like maybe three to five gigabytes a second okay sequential like i i would not be surprised if that's where we end up because when we move to the final deployment we're gonna do 10 wide which will be better for performance we'll do 10 wide raid z2s which is basically the same as 15 wide raid z3s yeah and then we'll obviously have a lot more drives if you've got 10 drives of which you could lose two of them before you any data loss that's kind of the same as having 15 drives of which you could lose three in terms of the ratio of the whole pool if you look at it from a wide perspective but it's probably slightly better to have more v devs if you run the probability analysis anyway so this no no we gotta explain the data set is set up with caching only metadata which is what we want the ram cache is turned off uh let's give it a second there we go we're getting like let's just give it give it you know just warm up there bud come on okay that's about two two and two gigabytes john you can do better than that i didn't even look at it come on three three yeah it's not three sustained that's for sure 3.6 you can't just read the high numbers i'm reading just the high numbers that was 3.7 there for a second yeah it really doesn't work like that that's 579. oh there's four thousand percent there's point three probably has something to do with the fact that the drives are pretty full already remember we have no nvme caching we have no ram caching like this is raw dog giant v dev just disc now tell me something hold on a second wait here's something i don't fully understand this is really cool for maintenance the fact that you can have the system running you don't have to pull all your cables out at the back in order to see yeah the cable's staying in the same spot super nice but how is all of this still connected these drives are blinking these are doing things oh it's just like a ribbon yeah giant ribbon cable in the back you see it back there oh that's super cool man that's amazing for maintenance so you can just go be like okay yeah i've got a bad drive and bay whatever pull this out you don't have to slide any servers out see you later because this is internal right you just pull it out and you grab that one we're gonna have to label them oh yeah or i might just do a spreadsheet honestly um yeah that would be fine if they were sas drives you have like uh sas enclosure service or something like that ses whatever anyways you can go and be like what's in that slot or by drive you can say what slot is that driving and it'll tell you but uh since they're saying it drives they'll just show us not connected let's put this back in this is making me anxious let's do that i'm like i just oh boy okay yeah so we're doing 16 jobs in the i o depth of 16 block size one mag it's gonna be probably basically the same oh two thousand whew blow my skirt up more consistent though yeah like nope there it goes see you later in the grand scheme of things we're going to be writing to this thing like basically peak one gigabyte a second ever and it will be better when there is more v-divs okay then i think we're good good to tell you about our sponsor bessie bessie proves that waterproof shoes don't need to be ugly or uncomfortable thanks to their dymatex technology you can keep those toes dry while remaining sleek snug and stylish need even more comfort and breathability the everyday move line has added padding at the midsole and a looser knit to keep up with your busy active lifestyle do you hate tying laces tired of chasing the rabbit around the tree and down the hole well fear not because vessy's everyday move comes with handy pull tabs for easy slip-on action and are made 100 creature-free so you can be walking around on fluffy clouds with a clear conscience all day long and when vessy says everyday move they mean it hot cold wet dry stay comfortable in any weather just think of all the freedom you can get with your new shoes so treat those feet of yours to vessy everyday move shoes and save 25 with offer code linus tech tips at vetsy.com linustechtips if you guys enjoyed this video you might also enjoy the one where we explain what happened to the old vault and how we lost nearly all of the data on it what we got it all back well yeah i know but first we lost it well like sort of
Info
Channel: Linus Tech Tips
Views: 2,001,947
Rating: undefined out of 5
Keywords: the mother vault, holy shit server, holy server, the vault server, the biggest server ever, supermicro, jbod
Id: n_izpaZ0u5o
Channel Id: undefined
Length: 26min 0sec (1560 seconds)
Published: Sun Jul 03 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.