Hardware Raid is Dead and is a Bad Idea in 2022

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

He mentions with BTRFS that you can set a RAID policy on a per folder basis, I don't believe this is true. From what I understand there was talk at one point about making RAID policies set per subvolume, but this never came to fruition. What you can do is set a different RAID policy for data and metadata, but that's not really the same.

The video has some great information in it regardless.

👍︎︎ 21 👤︎︎ u/CopOnTheRun 📅︎︎ Apr 07 2022 🗫︎ replies

I am glad he did it on his main channel. I have been sending people this link to his 7 yr old video on his third channel. It is a pain for me to find in the "algorithm" even when I know what to search.

https://youtu.be/yAuEgepZG_8

👍︎︎ 38 👤︎︎ u/zrgardne 📅︎︎ Apr 07 2022 🗫︎ replies

Beyond raid 1/10 for hypervisor hardware raid has been dead for quite a few years already.

👍︎︎ 59 👤︎︎ u/cruzaderNO 📅︎︎ Apr 07 2022 🗫︎ replies

I'm a little split on this. On one hand ZFS is great, and the ability to detect and recover from bitrot is hard to get unless you're paying lots of money for a SAN or specific hardware RAID cards. So I can't understate how much I agree with him that the resiliency of file systems like ZFS is a large benefit.

On the other hand it's not quite as dire as Wendell points out (hardware RAID is definitely not dead!). HDDs and SSDs both have error-correcting codes as part of the sector (separate to the 520 byte sectors that RAID cards can store checksums in). So in most cases bitrot is going to be caught by the HDD and appear as a read error to the hardware controller, allowing it to recover the data from the parity. So the case of the hardware RAID controller getting bad data from the disk is quite rare, and will do a decent job at avoiding bitrot.

Also in my professional life I've never actually seen a problem that can be directly attributed to bitrot. Though this is likely just a function of how much data you deal with. Most large things that I've worked with are either SAN based, or use a replicated storage system with its own checksum and recovery. And to be honest I'd always prefer moving that check-summing and recovery as close to the application as possible.

👍︎︎ 33 👤︎︎ u/teeweehoo 📅︎︎ Apr 07 2022 🗫︎ replies

0:11 - "Support for it has gone away at a hardware level a long time ago"
No idea what he's on about. Modern HW RAID cards can do SATA/SAS and even NVMe (called TriMode) and work fine on Ice Lake and Milan based servers.
They also now measure in the hundreds of thousand to millions of IOPs as well.
What does "gone away" even mean in his context.

1:45 - What he's talking about is NOT HW RAID. It's SW RAID with a HW Offload to an Nvidia GPU. Nowhere near the same tech as LSI/Broadcom or Microchip/SmartROC cards. (The 2 biggest vendors folks like HPE, Dell, Cisco and Lenovo use)

12:15 - Battery backed cache. Actually Batteries are still offered, along with a hybrid SuperCap/Battery module as well. But many controllers now include some NAND (SSD basically) on the Controller or Cache module and the Battery/SuperCap only needs to provide power long enough to dump the RAM cache to the NAND and do a CRC check, then the card powers down. At this point the server can remain un-powered for days or weeks. When the server powers back on, the NAND is checked and any data found is pulled back to cache, CRC checked again, and then flushed to the drives before the OS has even had a chance to boot.

20:00 - PLP - hahaha, no. It's got a DRAM based cache and the PLP is to protect the data in flight in the DRAM so it can be written to the NAND before the card loses power.
But Casper, how can you be sure? Wendel is SOOO much smarter than you.
https://www.samsung.com/us/business/computing/memory-storage/enterprise-solid-state-drives/983-dct-1-9tb-mz-1lb1t9ne/
"to provide enough time to transfer the cached data in the DRAM to the flash memory"
Gee, maybe because it's on the damn website for the SSD...

While I agree that old school HW RAID isn't a viable alternative for large Enterprise systems anymore, he glosses right over that this is not what most people USE HW RAID for anymore.
Smaller deployments, Edge or simple RAID 1 Boot drives for example, are FAR and away the majority in the Enterprise.
Large data pools are either Storage Arrays, Software Defined Storage, or giant PB scale Distributed storage systems like Ceph, Qumulo or Scality Ring.
And those Strorage Arrays like NetApp, Nimble, 3PAR and others, often ARE doing the RAID and storage management in a CPU in SW and some have a HW offload accelerator already as well.

Videos like this are why I personally can't stand L1Techs and LTT.
They come off so smug and don't leave room for any alternate viewpoint other than theirs.

Find me a big company like Coke or Disney or Bank of America who is using ZFS.
I'd bet <5% of them touch it.
Yet folks like L1 and LTT think it's the be all end all to data storage.

👍︎︎ 14 👤︎︎ u/Casper042 📅︎︎ Apr 07 2022 🗫︎ replies
Captions
there's no end of high-end hardware maybe high-end thinking here at level one tell you right now hardware raid is basically dead for the high-end high-speed 64 cores of awesome goodness support for it has gone away at a hardware level long time ago and so what the very definition of raid means has evolved and changed somewhat probably not for the better and if you expect raid to help prevent data corruption you've missed the boat on what raid has become in the past raid meant that you had both physical device redundancy in case of failure and the ability to detect and correct data errors even errors not reported by the drives themselves correcting errors bitrot is important make no mistake but almost all modern hardware and software raid solutions rely on the drives themselves to honestly report errors that means many of the types of rate arrays especially hardware assist or pure hardware rate arrays do not actually verify the data that they are giving you is the same data that you originally gave them for storage and that is madness because it can be an opportunity for bitrot to set in now some things i think that should not have been forgotten were lost and we're going to talk a little bit more about it [Music] raid so we'll get this started was i got a chance to play with supreme raid sr 1000 for me the marketing of that thing evokes a sense of yesteryear when software raid was really bad slow and unreliable and hardware raid cards were just magical they went with mechanical hard drives and they abstracted away all the complexity and everything just worked and they handled errors really well the grade 1000 is a car that promises to offload parody calculations and it does perform impressively i like what it promises to do but i really i don't like the implementation you got to go fast but if you're going to go fast you've got to understand the dangers that you're getting yourself into and bit rot is not one of the use cases this thing covers the promise here is if you have an array of multiple nvme drives you could give up the capacity of one or two of them for redundancy that way any one of them can fail and things will keep right on working now it does support raid 0 and raid 1 both of those offer just speed and and raid 1 will do mirroring but there's no parity calculation involved in that it's just mirroring so it's not really offloading a lot in that case what's interesting is parity calculations around raid 5 and raid 6. well how it's implemented is with a not super modified nvidia gpu the nvidia t1000 theoretically that does any raid calculations rate parity calculations rather than the cpu but this is a pci express 3.0 by 16 card and that means that the absolute maximum that it can perform at is 16 gigabytes per second that's the max bandwidth of a pci express 3.0 by 16 card 16 gigabytes per second and yet in their press release they're talking about 100 gigabytes per second in the reads that seems very fast if the pcie bandwidth is 16 gigabytes per second but you can read at 100 gigabytes per second plus how does it verify that the data is returning to you is the data that you gave it the data that it's returning matches the check summon parody that's theoretically stored somewhere well there are no connections on the card it's just a pcie card so it's going to happen over the pcie bus right nope turns out it doesn't happen at all it doesn't verify crap i checked it by injecting a tiny bit of corruption and it doesn't doesn't detect it their website says g-raid supreme raid shoulders all the i o processing and raid computation burden the first part can't be true because there's not physically enough i o bandwidth to carry all of the i o data to the array it's just the parity data you're going to max out at 16 gigabytes per second it doesn't do that it's obviously not doing that if the array can be faster than the physical interface of the card so the explanation is that it doesn't check for parity unless a drive is totally missing or an nvme drive itself actually reports the error this may be a little disturbing because a lot of it pros still have the expectation that hardware raid controllers especially ones that build themselves as enterprise class can detect and correct errors on the fly not just handle the case that a whole drive has gone offline or that a drive is reporting an error this thing the g-raid 1000 it cannot detect silent corruption type of errors to save its life search the manual for verify or patrol or integrity or consistency or scrub it doesn't have a way to initiate a scrub from the user side of things i'm told they're working on that but there's nothing in the manual that indicates that it could scan a volume for inconsistencies and the website also mocks the raid 5 right hole says and to maintain data integrity of the cache in the event of a power loss a battery backup is also required the card will uh suffer a huge performance drop if it fills up if the battery is exhausted or the module's full it'll need to switch to right through mode in order to preserve data integrity well guess what grade 1000 it's it's got to be a little careful there throwing stones because it's a device that also lives in a glass house so instead it looks to me like in the case of the raid 5 right hole this card is just keeping performance high by casting your data into the void in scenarios where the raid 5 ride hole might affect you because i can easily introduce corruption and in order to detect corruption it's gonna do a full rescan we did it team but i'm picking on g-raid when really this also affects a lot of enterprise solutions even linux md linux md raid it actually has the partial parity log which helps close the raid 5 right hole and keep the array consistent but even if these folks did that there would be extra writes which is a cause of a loss of performance so linux md is a little safer in that regard but linux md also doesn't verify the parity except and unless a drive reports an error a drive self reports an error so the marketers would gaslight you into believing this was always the case and that no raid has ever been responsible for bit rot and silent corruption but come with me step back in time for just a moment we're going to take a look at old school raid our story starts really in the 1980s but we can fast forward a bit until 2002 enterprise raid 5 in 2002 would absolutely detect silent corruption and correct solid corruption and then use the parity data to figure out what was wrong in enterprise storage so what makes raid 5 work better in 2002 than 2022 well let's think through it if you had five drives in a raid five and you fetched four chunks of data and one chunk of parity from those drives but the parity is inconsistent who's lying which drive is lying to you how do you know which one's inconsistent is it a block of data that's wrong or is it the parody that's wrong the answer in 2002 was that there was more information than just the parity you see the sector size of the hard drives were also larger eight bytes larger now let's let's go for aside for a second and talk about sector size a sector is the smallest parcel of info that a drive will work with going back more than 30 years hard drives could operate in one of two modes formatted with 512 by sectors or formatted with 520 bisectors you can't just write one byte on a hard drive you see it's not a sector at a time now it is true that some hard drives might not support 520 by sectors but you better believe that an enterprise-grade hard drive could absolutely support a 520 by sector for these reasons the raid controller itself would store a checksum of the 512 bytes of data in the sector in the extra eight byte so you have 512 bytes of data and then a checksum that goes in the last eight bytes and the raid controller was responsible for that so when you created a raid array the hardware raid controller would uh create that under 520 by uh volume of 520 drives but then would present it as a 512 byte volume to the operating system because most operating systems don't know how to handle anything except 512 bisectors now 8 bytes that's not enough for error correction but in our same five drive you know raid 5 example the controller has extra bytes every sector on every drive to see which drive is lying it doesn't rely on the drive to tell it that it's lying the other reason it worked like this is performance if you stored 512 bytes in one place and then the extra check some somewhere else that's two mechanical hard drive seeks and hard drive seats are measured in milliseconds that's just too much overhead making this sector a little larger from 512 bytes to 520 bytes it was much less of a performance impact uh on the read operation you could also use a really complicated parity algorithm with parity blocks where the parity and the data blocks are much larger than the sector size usually to the tune of a half a megabyte a megabyte something like that but uh you get a lot of overhead with bigger sectors it's not great for performance it's computationally kind of expensive to work with blocks that big so why not 520 bite sectors today well drive vendors charged a lot for the option for 520 bite sectors especially toward the end air on top of that 512 bytes is also too small in 2022 almost all hard drives the last five years use four kilobyte sectors that is 4096 bytes or four times five twelve here is a sandisk two terabyte flash drive from 2017 this is enterprise grade guess what 520 bisectors on this it's built for the enterprise and it's a dying breed anyway back to 2022 almost all hardware raid cards today if you have a 512 bite or a 4k sector they don't do any parity checking unless the drive itself reports an error it is almost unicorn territory to find a hardware raid controller that will use parity information without one of the drives speaking up and saying hey i've got a problem well enter raid six two parity drives so with two parity drives you are storing parity information in two places it's kind of like the separate checksum information in the large sector and it's also computationally less expensive so you can figure out who is lying as long as only one drive is lying of course the downside with raid six is that it takes another extra drive of redundancy if i have five drives with raid five i have four drives worth of rate rate capacity if i have raid six i only have three drives worth of capacity what about the great sr 1000 well it turns out with raid 6 it doesn't check the parity either unless the drive reports a problem of course neither does linux md raid or at least it didn't initially and no one has updated the man page so i think it still doesn't how did i test all this well i introduced some minimal corruption in the case of our sandisk 2 terabyte flash drives from 1273 ad i updated the data but i did not update the checksum the lsi hardware raid controller immediately flagged it said that there's an inconsistency it took a while to do the i o but it fetched the parity and reconstructed the data because it knew that the block that i messed with was messed up because i took care to mess with the data block i injected gandalf.txt into the into one of the drives in the raid 5. so it looks to me like the great is only checking parity when the drive itself actually reports an error and the drive itself doesn't report an error if it doesn't think there's an error so if me injecting an error doesn't actually cause an error the way that it did with our sandisk 520 sector drive so what does that mean in the real world that means that there is nothing going on with this raid controller to check and verify that the data that you gave it is actually the data it's giving back to you g-raid never complained or noticed the inconsistency didn't matter if i use raid 5 or raid 6. now raid 5 like i say could maybe be forgiven for the reasons i explained but rate six i mean i guess i remember i mentioned on the g-raid website they seem to be making fun of the whole raid 5 battery well with modern controllers it's not a battery it's a super capacitor but okay that's fine what is happening there on the hardware raid controller in that server over there is that it's keeping a journal in memory of all the stuff it was doing all the writes and everything else so if it loses power or crashes the little battery or the super capacitor will keep track of that it's a journal basically of everything it was doing for the last few seconds when it comes back on it doesn't immediately clear the ram it checks the ram and checks the last few things it was doing and then it checks all the drives to see how far they made it if there's an unclean shutdown the raid controller can look at what was being written since we still have the data in the ram because it's preserved and it will check and make sure that it's perfectly consistent that closes the raid 5 right hole now linux md software right on linux doesn't by default store a write log like that it can on a non-volatile device but nobody ever does that so it has something else called the partial parity log and so it writes out to the partial parity log what it's about to do and then it does it with all the drives and then it writes to the partial parity log that it finished it and that slows things down because well it's got to do all that extra writing and then even after an unclean shutdown it still fully rechecks everything the g-raid card well it will only do a full rescan or consistent consistency check after a crash or power loss it doesn't have anything like the partial parity right log or anything like that that i can find and so the performance is better because it doesn't do any extra writing uh the battery the card with the battery the super capacitor it doesn't have to do a rescan or anything i mean it does support consistency checks to be sure but it is able to make sure that it's consistent so you save there sometimes raid cards call that scan a patrol read or consistency scan on linux md you can schedule a rescan so that it'll scan all the drives and make sure everything's perfectly consistent with the parity and issue any corrections oh and linux md again does another really boneheaded thing here if there's an error in the parity but no drives are reporting an error it will rewrite the parity even if the data was incorrect that seems boneheaded and wrong to me oh speaking of that that's exactly what the g-raid card does too it assumes that if it's in a raid 6 configuration and it encounters one piece of information that's wrong it will rewrite the parity not affecting the data even if the data had a problem what a mess that's you know it sounds like it's mostly directed to g-raid but in reality it's just modern raid and this is why modern raid is basically dead this because this is a mess and as a system administrator especially on enterprise gear you don't want to deal with this you're going to have the extra right overhead and you want to maximize performance you need a technology that combines the ideas of raid with the particular complexities of a file system the file system itself should be the thing that understands that it's working with multiple physical drives multiple nvme drives multiple whatever and is it it's probably a good idea to have the file system abstract away the storage instead of a raid card deal with it at a system level the file system can journal rights as they often do anyway but also the journaling can be hardened against things like the raid 5 right hole the system can see that it's working with multiple different devices multiple different block storage zfs and raid z does exactly that and zfs does it well to a lesser extent so does btrfs both zfs raid z and btrfs absolutely positively do an integrity check on reads and it will give you an io error if it's about to give you a file that has a different checksum than what it was when it was originally stored which is great make sure that bit rot doesn't happen i created the exact same testing scenario that i used here raid z caught it passed with flying colors that actually corrected the error issued a thing to the system log and said hey here's your file it's exactly what you gave me instead of even needing 520 bisectors though the checksum and the location of the parity data is built right into the file system that helps minimize the amount of extra ios unlike almost all raid controllers this setup also works for raid 1 or mirroring sometimes it's called zfs mirroring but it's mirroring basically btrfs works pretty much the same way although raid 5 and raid 6 code with btrfs not quite as hardened but btrfs also has some other innovative ideas which are sort of new and fun if you have five drives and you want to use a file that is mirrored across all five drives you can create and set a policy for that at the file level and that's exactly what happens if the checksum of one file doesn't match it's got four more mirrors it's a good argument for system administrators making decisions about redundancy at the entire volume level what i mean by that is instead of taking a bunch of nvme drives and deciding we're going to have one drive's worth of redundancy across the entire storage pool that's made up of all of these you can just say this folder should be mirrored across all five devices this folder doesn't need any redundancy this folder needs raid 5 levels of redundancy this folder needs right six levels of redundancies where it will tolerate one or two failures you can even do raid 10 you can say i need this to be as fast as possible and you don't really have to pre-allocate the entire storage device for a particular level of redundancy you can set it as a policy at a file level or a folder level whatever makes sense and this is a new modern good idea and even zfs doesn't do this yet i definitely think storage that lets you create a policy and have whatever level of replication and redundancy you want is probably the future if we look elsewhere and truly enterprise solutions stuff like what netapp has for example enterprise storage provider we see the same kind of policy based redundancy in that enterprise storage oh yeah and by the way netapp does the parity verification on read so now admittedly the party verification i read is less common today than ever even with enterprise-ish solutions from dell and hp and others but no i think g-raid would have been far more useful and interesting as a product that accelerates parity calculations in linux maybe even with zfs they're open source and writing the modules to do that is entirely possible oh by the way the sr-1000 it doesn't have ecc memory on board so the gpu parity calculation that's not needed until you actually have a problem or at least not checked until you have a problem gee whiz yeah there's no error correction on the card so if it's introducing its own errors you'd never catch it i mean it's a cool idea to use gpu as a data accelerator but pci3 and playing it a little fast and loose and hardware raid 5 and yeah i don't know it's got one foot in the future and one foot in the past because the reason there's not raid in vme solutions is because it doesn't make sense anymore i certainly don't trust an nvme device to tell me if it's malfunctioning i mean what kind of crazy clown world is that the best approach really is to handle the redundancy at the file system level period do the integrity checking there this isn't to say that you know an offload accelerator for some computation isn't a good idea but ultimately this doesn't provide data integrity protections against bit rot even in ideal scenarios and that's not really a dig at g-raid but pretty much where raid is now raid is just a checkbox for people that are looking for it but don't understand it if you look into raid 5 and raid 6 you're going to be disappointed even with linux md raid 6 and there's a silent inconsistency it's going to clobber the data so it's kind of marketing nonsense oh and before i forget raid is not a backup zfs and btrfs and approaches like that they're going to be slower because guess what they're doing more work these other approaches are faster because they're doing less stuff they're not checking the data so nvme check it out this is a samsung 983 pro nvme it's an m.2 format so probably familiar to you look at the top of this card this is mostly capacitors why the heck would they make an extra tall extra long 110 millimeter m.2 most of the ones you've probably seen are 80 millimeters why would they do this why are there so many capacitors on the top of this this is power loss protection this is the same thing as our super capacitor in our and our rate setup because it turns out in order to make flash memory go fast you've got to do a lot of parallel rights so that whole raid 5 right hole it exists here on this card there's a cpu here and ram this thing is doing computation even though it's a self-contained device there's a lot of crazy shenanigans going on with this if this thing loses power at the wrong moment these power loss capacitors help ensure that internally this thing is as consistent as it possibly can be if you build your nvme array out of nvme drives that don't have these then chances are they're they're going to silently corrupt your data so do you really think you can trust this to tell you if the data is correct or not my experience is no you can't but i do think it's interesting that enterprise-grade nvme drives have lots and lots of power loss capacitors now does this matter for zfs do you need the power loss capacitors for zfs well it helps but it's not critical because zfs has paranoid levels of paranoia in terms of is the data that it's getting back from the block device what it gave to the block device it doesn't trust the block device as far as it could throw it that is a great philosophy now zfs might not be the fastest and it might not have all the g-whiz features that you can get from you know truly enterprise solutions like the netapp stuff but you'll never get a file back from it that you didn't give to it in the first place at least not without a lot of loud complaining from zfs and that at the end of the day is what is most important to me i'm little this is level one this has been a quick look at raid and all its forms and discussing some of its failures and its evolution over time [Music] you
Info
Channel: Level1Techs
Views: 484,050
Rating: undefined out of 5
Keywords: technology, science, design, ux, computers, hardware, software, programming, level1, l1, level one
Id: l55GfAwa8RI
Channel Id: undefined
Length: 22min 18sec (1338 seconds)
Published: Tue Apr 05 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.