ZFS vs. RAID - vdevs and more!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning everybody and welcome back to next door nedman I have mentioned ZFS a couple of times I want to say on this channel uh but always in passing I mentioned it when I was setting up a new open sense server and I went into a little bit of detail but I thought that today I would go into more detail and let you know more of how it works why I'm interested in it and why I think it's beneficial sound good cool so ZFS is most likely in my mind at least to be compared to a traditional raid system in a traditional raid system you have multiple discs that are all working together and you will have different topologies different setups for how you want those discs to work together for example in a raid zero it's just a stripe it's a stripe across multiple discs if you lose any of those discs everything's gone all you benefit from is the increase in speed by having each right spread ac across multiple discs you have raid one which is a mirror every right is duplicated onto two discs you don't get any increase in speed at least not on the right you can get an increase on speed if your controller handles it properly um but you do get redundancy you do also however have two duplicate copies so you only get to make use of half of your total disk space there's other topologies raid three used to be common not anymore raid four used to be out there not anymore raid 5 used to be pretty common it is a stripe across multiple discs but with parody and this means that you can lose one dis up to one dis not more and the additional par that's been calculated and written to disk will fill in the blank and allow the machine to calculate the missing data cool if you lose more than one dis everything's gone raid six similar except par is calculated in two dimensions Dimensions what is this don't worry about it it's it's a Computing term you don't need to get into the details of that you can think of it as two different calculations of parody such that in raid six you can lose up to two Diss and you can use the different parody calculations to reconstitute the missing data from up to two diss but again not more if you lose more than two discs everything's gone raid works on whole discs that's it it just it just either knows the dis is there or the dis is not there it does not have a way to say is the data being returned by this individual dis good or bad raid is also file system agnostic it doesn't care whether you're running NTFS or fat 32 or ext4 or anything else it doesn't care to a RAID controller all of these discs are simply block devices and it presents in return a single logical block device it says here is a dis Drive do with it as you wish and the RAID controller handles the details of managing the discs this is a very layered approach it's not a network layer we're not going to go into the OSI networking model at this point um but it is a very layered approach you have discs you have raid then you have the file system then you keep going when you have data split across multiple diss either with raid zero raid five raid six whatever one of the settings you need to choose is a stripe width this cont excuse me this what's the word I'm looking for I'm having a moment of just brain blankness this is great this controls the amount of data that is sent to each dis so if you have a 128 kilobyte stripe width it doesn't matter if you only have a 4 kilobyte file to write you got a 4 kilobyte file that's nice it's a 128 kilobyte stripe width so you have to read in 128 kilobytes of data from all your discs you have to recalculate your par with the changes for the from the 4 kilobyte file and then you have to write 128 kilobytes back this is also known as raid write amplification because you got to write in your stripe width and you can choose a smaller stripe width say 32 kilobyte 64 kiloby and that's better if you're working with lots of small files or you can go bigger like 256 512 I've even seen some these days that have a full megabyte stripe width that's better for large files because it's fewer separate reads and writs that needs to be done but if you have small files your right amplification is going to be hitting you pretty hard The crucial point is this is a decision that can only be made once when you're setting up the array you can't change it afterwards you'd have to delete the array and reset it because once it's set all of your data is written in this stripe width period that's just how it is so ZFS let's talk about that ZFS has a lot of powerful features but it breaks the layered Assumption of diss raid file system and it does this because the file system has information that can optimize the way the raid is handled and the raid can do a better job if it has more information about the actual discs that it's controlling so before I get too far into the details there's one other thing that I'm going to mention it's pretty common among server admins and network admins to say hey if you're going to be working with raid you want a hardware RAID controller you don't want software raid why is this software raid relies on a specific driver if you don't have the driver you can't talk to the raid a hardware raid will typically like I said present just an abstract here is a block device it is a dis Drive do with it as you wish and then the RAID controller will handle the details Hardware raid also has its own onboard chip essentially and that can handle splitting the data it can handle reading the data it can handle doing the par calculations depending on how good that chip is and it doesn't have to be micro managed essentially by the operating system cool so ZFS being a file system does it's not a hardware controller it's software does that make it worse there are ways that you can mitigate this and ZFS uses them in order to prevent itself from becoming an undue burden on the system and slowing down performance okay so let's start by talking about the actual physical setup of ZFS in ZFS your main Central Storage array as you might call it instead of arrayed array it's called a zed pool it is a pool of storage and from there you can carve out sections of it to do various things and we'll get to those in a little bit but your Zed pool is composed of one or more vdevs virtual devices each one of these vdevs can be a different topology it's up to you you can have a single disc as a vdev if you want you can have a mirror of diss as a vdev you can have three discs as a mirror it's a three-way mirror you can have nway mirrrors how many copies of data do you want you want five discs all with the same data you can do a five-way mirror up to you your choice you can do a raid zed1 which is similar to raid five but it's not actually raid five we'll cover the difference in a moment but raid zed1 similarly has the ability to take one drive failure and the vev will keep running the vev will not have failed a raid zed2 again you can lose up to two discs and the vev will continue functioning raid Z3 does its parody calculations in three different dimensions so it can take up to three Drive failures and keep running this is all very well and good but like I said each vev can be a different architecture if you want in The Zed pool excuse me so that means you could have a mirror you could have a raid Zed 1 and you could have a raid zed3 all part of the same pool but with different topologies where is this useful I don't think it is but it's a thing that can happen and this is why you need to be aware of it all data in The Zed pool to is striped across all the vdevs this is how it works so if you add a new vdev as new data is coming into the pool it will be striped across all of those vdevs and if you happen to have a vdev that is AR raay zed1 for example then the data for that stripe will then be striped across the various discs in that array excuse me in that array in that vdev terminology however if you make the mistake of accidentally adding a single disc vev to your pool what have you just done well you can't remove a vev from the pool once you've added it that is one of those things that you don't get a redo on that if you've added a single disc to your pool you've essentially slowed down the performance of the pool to that of a single disc because a single disc has to be part of all the data read writes and if that one dis fails boom now you've lost your pool that's it there is no redundancy at the pool level the redundancy is provided by the vevs if you have a pool which is one mirror okay you can lose a disc out of that the pool survives if you have two mirrors you can lose a disc here you can lose a disc here your vdevs are still functional and so your pool continues if you have two mirrors and both of these discs die now your pool is broken you've lost one of your virtual devices and so when you have it striped across the two to have lost one of this vdev is the same as having two single devices in raid zero and losing one of those your data is lost so if you have multiple vdevs but one of them is a single disc if it goes everything goes and it's remarkably easy in some cases for new admins or admins who are distracted not paying attention whatever to accidentally add a single dis vdev and now you're kind of stuck with that there is I think one way that you can fix this somewhat is by upgrading it by converting a single disc to a mirror I think that is something that you can do uh and you can also break a mirror by setting one of its devices to be offline but that's not something that you want to do and it's really kind of not optimal if you want a topology where you have one vev as a raid zed2 and another vev as a raid zed2 and now you've got a vev which is just two discs in a mirror that's not going to have the same read or write performance as these other vdevs where you've got I don't know say four discs in raid Z2 and four discs in raid zed2 when you're reading and writing you can pull from four discs at once whereas this one where you're stuck with two discs for the same amount of data because again it's striped which means that if the capacity of the vdevs is the same then the same amount of data is going to go to each ZFS is also intelligent about this if you have let's say three vdevs just because that's the example we're going with if you have a vev that was 50% full you added a new vdev which is obviously 0% full because it's brand new ZFS will put the data proportionally more on the new vev because it has more free space available so it'll try and balance the workload according to how much storage is available but not necessarily according to Performance it's by capacity so it's best if you the administrator figure out how you want this topology to work and be really confident about it be really careful when you're adding new devices because there is no undo button for adding a new vdev do it wrong and you might have broken your entire higher setup that's fun isn't it for certain values of the word fun so that's the vdev level and it's important to note that you can kind of replicate a traditional raid setup like this if you have multiple mirrored Pairs and then you're striping across them a stripe over mirrors that's a RAID 10 in typical speech if you have multiple raid Zed ones okay you're now you're now running a raid 50 or Raid 60 if you're using raid z2s there is no raid equivalent for a raid Z3 so it just is what it is not a lot of people I know have actually run raid Z3 either it's an option same as using a single disc is an option that doesn't mean that a lot of people want to use it what separates raid zed1 raid zed2 from correspond raid five or Raid six is another feature that ZFS has which is why fix yourself to a an arbitrary 128 kilobyte stripe size because ZFS is also the file system it knows how big the file is I'm only writing a 4 kilobyte file let's use a smaller stripe size for that I'm writing a 5 gab file small stripe size isn't the best way to go let's use a bigger stripe size ZFS uses Dynamic stripe width according to how much data it has to write in each transaction group it's not per file necessarily but depending on how many rights are being accumulated at any given point in time that will it will set a dynamic stripe width based on that so it's trying to make sure it doesn't have to to do more calculations and more reading or writing than it actually has to do this is intelligent ZFS because it has Direct Control of the Diss and this is why it's very important that you never never never run ZFS on a RAID controller because the RAID controller abstracts all the discs away and just says I have one big logical Drive use it so you can't actually look at all the discs ZFS you want to have each individual disc individually addressable with no raid whatsoever and when you do this suddenly ZFS can say okay so I pulled data from two different drives in a mirrored pair I got XY Z out of one and I got x a zed out of the other because ZFS saves a check sum of every block of data it can look at the data that's returned and says actually according to this check sum XY Z is correct X azed is incorrect so transparently while I'm reading that data back let's say this dis has an incorrect copy of the data let's fix it just silently let's go fix that it'll check it on a file by file basis this is very different from traditional raid with traditional raid you don't have a file by file confirmation that the data is correct it's either the disk is there or the dis is not there if you have a failing disc in your raid that can end up passing you corrupt data and in particularly bad cases you can end up with the RAID controller saying hey um I've got mismatched data and I don't know which is correct so puncture if you get a raid puncture there's not a lot you can do you have to erase the entire array and start over from scratch which is a real drag so ZFS can avoid this by actually check summing the data and seeing which Drive is good which Drive is bad and fixing it on the spot or marking it as hey this sector is an issue let's remap it or do whatever else we have to with it you can also do a preventative scrub of your data not necessarily during production hours because a scrub is kind of intensive but during a scrub it will read all the data back on each Drive verify everything is correct and proactively check for any errors and then if it finds some it'll rewrite the data because it has a check sum it can tell you what the correct data is so ZFS has a lot of features like this where because you have Dynamic stripe width you're reducing the performance requirement of having to pull back lots of data read it update it write it back you're getting rid of your right amplification you've got check summing of individual blocks of data individual files it's actually blocks of data but hey you've got a choice of topologies that will do pretty much whatever you want it to but you have to be careful to set that yourself and make sure that you put some forethought into it then there's other things that you can do ZFS by default uses system Ram as a cache for data it's an Adaptive cache which means that it will cache the data that is most frequently used even if it's not the most recently used you can add a second level of cache level two adaptive replace adaptive read cache excuse me your L2 Arc can be another SSD fast device cool you've got Layer Two cache you can do a an SSD for a separate log device or a slog this holds the ZFS intent log which is similar to a battery backed cache on a RAID controller where typically this would accumulate rights it's got a battery so that if you have a power outage before it can write that data out to disk it'll keep the the cache it will will have a battery to keep it powered and then when the power comes back on the first thing it does is write out anything that was still pending back to the discs ZFS does much the same thing except instead of having a battery backed cache it has the ZFS intent log what it intends to write to the disk and so it will take any wres that are inbound save it to the zil the zil and then when it has gathered enough data to make a transaction group about every 5 seconds or when there's enough data to actually make the right worthwhile then it'll write it out to the discs so there are other devices that you can add there too there are other new devices there's metadata only devices there's special devices which I haven't used honestly yet so I can't tell you a whole lot about them but this kind of of setup can be frustrating for traditionalists who go it's breaking the layer architecture I mean discs should be separate from raid should be separate from file system so that everything is self-contained I get that I do but in this case there's also a lot of benefits that come from your file system being able to say hey these are separate files they shouldn't necessarily have to go into a 128 kilobyte strip that doesn't make sense these are huge files putting them in 128 kilobit strip kilobyte strip excuse me again doesn't make sense because they're huge use a bigger stripe your file system can tell you that your discs can tell your raid hey um yeah I'm I'm the one returning bad data fix me and the and the system can go okay here's a fresh copy of the data save that thank you very much there's a lot of useful and Powerful features that can come with ZFS I have run out of time for today I was going to tell you about data sets and Zeds and all the other fun stuff that you can do with that but for now we're going to leave it at that just a discussion of the physical setup and some of the features that it has to increase performance so like I said that's it for now thank you very much for watching I am your next door nedman and we'll see you next time
Info
Channel: NextDoorNetAdmin
Views: 42
Rating: undefined out of 5
Keywords:
Id: LM-A-RLCZ9U
Channel Id: undefined
Length: 25min 35sec (1535 seconds)
Published: Mon Jun 17 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.