So if Hardware RAID is dead... then what?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hardware raid is dead okay you might not have gotten the memo about that so you should for sure check out my other video if you didn't see that today's more about okay if it isn't dead it's certainly got to be in life support right and saying its final goodbyes but more answering the question what feels the void that's left behind I'm not dead yet I don't want to go in the cart I'm getting better I think I'll go for a walk I was surprised how many of you wanted to convince me that where raid wasn't dead even when dealing with our kokia CM7 you realize that these nvme ssds can do like 12 14 gigabytes per second individually their PCI Gen 5 let me walk you through my reasoning step by step cuz clearly I can tell you it's not planning for the fjords not only is it dead for the common Arguments for why it shouldn't be dead a single a single CM7 will take a lot of the wind out of your reasoning but what about the latency and it's all about the iops 2.7 million iops for a single drive and this is nand Flash it's nothing special okay it is special cuz kokia but you know what I mean I think a lot of you are more worried about how to fill the void if it is dead then what to do if it is dead so let me uh take the worry about that out of the equation there's actually several ways of filling that void that can be approached today with what's available today and we'll talk about that too sometimes a piece of software or an application is already designed to deal with an array of block devices directly you don't really need a raid volume that's less overhead than a raid card would be anyway so it's kind of a combination of hardware and software maybe that could make more sense for some of these I mean Microsoft's working on their refs file system which like ZFS wants to work directly with block devices and not via a RAID controller but but before we get to that let's let's talk about why it's tough to get block device performance matching even one or two of our kokia CM7 from a hardware raid [Music] C the answer really is Hardware raid technology is evolving in a short time frame we've gone from a server CPU having 16 or 28 or 32 pcie Lanes to 128 pcie Lanes that's pretty much across the board whether you're team red or team blue 128 pcie Gen 5 Lanes now we've got eight of these kokia CM7 pcie Gen 5 using u.2 form factor but they are available in e.1 or e1s and e3s and .3 other form factors as you need them and like I said this Gen 5 SSD is capable of up to 14 gabt per second 2.7 million iops that's nearly saturating the bandwidth of four Gen 5 Lanes I mean look look at our Crystal disc Mark performance here this is just just Baseline to give you an idea for the hardware raid is dead video a few of you can see it yes a hardware RAID controller will be a bottleneck for this Topline throughput number but you say it's not about the throughput it's about the iops well even one of these drives is going to okay we'll we'll take a look at that there might be something to that not really but we'll take a look just remember that one of these kokia drives moves as much information as a gen 3 video card yeah 16 gen 3 Lanes the same as four Gen 5 Lanes that's uh let not to say that the random IO low Q depth performance is not important actually that is probably one of the most important things I mean you are right and a raid card with a large cache will improve those low Q depth IO operations without letting uh some of these high-speed devices get bogged down with a lot of small requests and if you think about it like you know you have a server hosting a bunch of virtual machines whether they're hosting virtual machines for workstations or a database server or whatever generally you're not running at 12 gbes per second second you know flat plank all the time you just need a few hundred megabytes here and there from random IO from tasks running in the background so the reasoning is sound that a raid card bottlenecking the Topline performance and it's really more about the io operations per second is is sound but it'll all fall apart under scrutiny really when we're talking about iops we're talking about latency and there is a relationship between the latency floor the the fastest time that you can get a request in and out and the overall throughput for that I mean latency effectively is lowered by the RAM on a raid card when you can serve the data from the ram cache on the car the RAM on the car is going to be faster even than our CM7 that makes sense but most of the time you're going to be a cash Miss and that causes latency stacking you get the latency of the cash Miss plus the latency of the raid card asking a CM7 for the block anyway so generally it'll be worse I mean it's it's worse except when you get a cash hit but most most the time you're not going to get a cash hit and in fact nvme is so fast that a lot of modern rid controllers don't even cash reads they just cash wres and it does help with wres but well one thing at a time so it turns out that these nvme devices are so fast they're driving changes and how the operating system handles block and IO devices yeah turns out back in the day when you would have a block device like this it would use interrupts to notify the CPU the CPU would say here's a block of data you need to do something and then the device the block device would do something with that and then send a software interrupt to the CPU to say hey I'm done with that give me the next thing but these modern drives are so fast and they complete those kinds of operations so quickly that it's actually usually more efficient and faster to pull the drive rather than using interrupts if you've been assistant administrator for a long time that sounds heretical but this is the reality that we find ourselves in CPU cores are so fast fast and the bus is so fast and the drives are so fast that the overhead of servicing that interrupt is considerably higher than just polling the drive and that's what the Linux kernel does by default now on most nvme devices these Gen 5 devices are so fast that putting the raid card in the way it hurts latency even before we get to thinking about the benchmarks H that's another way I could explain this whenever we are trying to characterize the performance of storage we're concerned primarily with two things the throughput that Topline number and Crystal disc Mark how fast is the darn thing and latency and we talk about latency a couple of different ways sometimes we say it's low Q depth sometimes we say it's QEP 1 performance but really these are different ways of describing latency how fast does it take to get a single block I'm going to ask the device for a single block of data how fast can I get that single block of data back because real world random IO a lot of the things that are happening on servers or workstations less so on servers more so on workstations uh has to do with these low Q depth operations you need to get a block from dis before you know which blocks you need next so by way of explainer think of me I'm your block device I'm going to fetch blocks for you this is probably the easiest way I can help you build a mental model of this if it's not clear so it turns out that in our mental model if you think of me I can complete a request to read a block in super luminal time it takes me zero time you think of a block and I give you the block instantaneously I got to rest for a second after I give you the block but to make this exercise simpler it's just instantaneous so you ask me for a block of data instantaneously I give you a block of data but before you ask me for the next block of data I got to rest for one second I'll also say that I can carry up to a 100 blocks at a time so if you know in advance which 100 blocks you're going to need I can bring you all 100 blocks instantaneously and rest but if you don't know which 100 BL blocks you need ahead of time then uh you're going to have to wait one second in between each request so you come with your first block and it's like well I don't really know which blocks I'm going to need but the blocks that I'm going to need are contained in the data of block one so go ahead and get me block one so I bring you block one instantaneously there's block one you look in it and you say oh now I need block two well I have to wait one second before I read block two I give you block two and then inside block two it says you're going to need block three like okay so I go and I get block three and I give it to you instantaneously and then it's like immediately you you say oh I need block four so how long is it going to take me to give you 100 blocks it's going to take 100 seconds but if you knew ahead of time exactly which blocks you needed you could give me the full list of 100 and I could deliver them instantaneously this is the relationship between latency and throughput if you can keep me fully loaded like I can carry 100 things at a time if you can tell me which 100 things you need then I can bring those to you instantaneously 100 at a time but if you can't keep me fully loaded if you don't know ahead of time which blocks you're going to need it only bring you one at a time that round trip latency making making you wait one second between requests that's going to kill our throughput it's one/ 100th as much potentially now suppose we're in QEP 1 request mode and the first block you asked me for is one and the next one really was two and the next one is three I can look at this list of blocks you know afterwards three or 4 seconds into this and say you know I'm not doing anything anyway I'm just going to go ahead and bl bring blocks you know number two through 100 and all of a sudden you come back to me and you say hey can you bring me block three I didn't I wasn't Clairvoyant I didn't know that you needed block three I just looked down and said okay you asked for one and two I'm not doing anything anyway I'm going to go ahead and bring blocks three through 100 because I can carry up to 100 at a time and yeah that worked out for us that was instantaneously dramatically way faster this is more common of a thing is that's happening underneath in individual nvme devices Than People realize in the olden days that was something your RAID controller would do for you and it still is to an extent a RAID controller can figure out the access pattern it can go ahead and algorithmically guess and prefetch whatever blocks it thinks are going to be needed but you know it's not it's not a given necessarily there's a lot more of that kind of thing happening under the hood with modern storage devices than than most people realize and sometimes it's even happening in an operating system layer the operating system is looking at it and saying ah you let's just go ahead and queue up all these requests so the other thing to look at in benchmarks is how many cues do you need to fill to get to our 2.7 million requests certainly less than 32 parallel requests at once and sometimes there's a mathematical relationship between our QEP 1 latency and the maximum number of IO operations per second sometimes our Max iops is lower than expected and that's usually a situation where the processor on the storage medium like the process processor inside the nvme is slower than can keep up with our QEP 1 performance cuz real world it's not actually QEP one I mean even an a completely Fair coin toss you're going to get you know six heads in a row tossing between heads and tails that's that's a thing that happens even in purely random situations in real world with server workloads with thousands of clients it's not truly random usually you're you're it's it looks random but it's really not and there's little things you can do to predict the load so anyway let's look at performance of our hypothetical nvme RAID controller to see that you know we're just doing napkin math here and the napkin math is not going to work out for us we got four of them just four I want to run 40 gabt per second of throughput that's taking out some serious overhead and I want my low Q depth performance overall to basically be the same because it shouldn't get any worse right so 16 Lanes of Gen 4 pcie just isn't enough to keep up with four drives times four lanes of gen Gen 5 storage some of you already conceded that on my Hardware raate is dead video it's like okay but what about iops because the low latency and the performance and the caching those things can make up the difference right that'll that'll create the overall better Real World performance in the way that you know a very low latency storage block device like Intel optane will have a better QEP one experience and so it'll feel faster than even the highest end nand flash-based device because the latency floor of an optain type device which is a different physical memory medium The nand Flash you know it's like half what nand flash is so those low QEP performance numbers are always going to be really good even though that Topline number for raw throughput is not as good on optin as it is on Modern nand flash devices and so you might think caching is a solution well your CM7 can do 10 gbt per second in its sleep per Drive 32 GB of cash wouldn't be an unreasonable start on a hardware RAID controller but a lot of these highest end raid controllers you can get get like from broadcom well there are only 8 GB of RAM and that's it and most of the time algorithmically that Ram is not used for read caching because the reads are so fast anyway it's to uh speed up rights the mega raay 9670 is one of the highest in controllers you can get and it does have its place I'll admit there are there are places where it can work but you've only got 24 Gen 4 Lanes to work with and it's 16 Gen 4 Lanes to the host when I talk about this 8 GB as a right buffer that's going to give us you know a second or two of low latency battery protected memory which can improve uh random right performance that is true a lot of the time the cash Machinery makes the latency worse so these controllers don't bother with read caching because putting something into the cash memory and then cash missing most of the time unless the cash is absurdly huge is uh more latency than just asking the CM7 to begin with so more more often than not these raid controllers simply will not cash reads because the drives are already fast enough in terms of reads and the cash size on the card is just not large enough wait what yeah it's clear from some of the comments that that a fair number of you don't understand that and when we're talking about Enterprise Solutions like truly Enterprise Solutions where you got racks and racks and racks of stuff they are a lot of the time using system memory as a read cach not the cash on the RAID controller the cash on the RAID controller typically is just used for transaction logging and right caching I mean the memory on a RAID controller is about 20% or 1/5 the latency of nvme it can be really good but it's not the wisest use of the cache on the card because it's not that much faster and the problem with the latency stacking means that just having that there in the pipeline is not a great idea you you generally for reads want to query the device directly so that's why the design of these kinds of things has gone in that direction now theoretically could you add more than one of these raid cards to a system and have them work together you know you got 24 lanes and say you add four of those to a system that'll use 64 lanes let you fully populate a 24 Drive uh you know nvme to you raid chassis theoretically you can do pcie to pcie transfers without involving the CPU but uh there's no red card I know of that does that I think I'll go for a walk you're not fooling anyone a hardware raid device like this might make sense if you got a backward operating system or a weird Edge case that you're looking to make ends meet on you know a raid one or RAID 10 volume with like two or four drives okay maybe but I'm talking about these chassis where we've got 8 12 16 24 four lane nvme at the front of the chassis and you want maximum performance minimum latency the nvme performance here is just far outpacing any theoretical RAID controller that can handle it I mean it's not just practical because pcie slots are only 16 Lanes wide but because you won't win in a latency versus direct attached storage type scenario fortunately the void that this is creating is already filled for a lot of common use cases one of the most common use cases for the solution itself is kind of adapting in there is databases database that is high volume and low latency a lot of database software has gotten new engineering the last couple of years to actually spread the database data out over physical volumes it's basically raid at an application Level I mean yeah it's neat because you can mix Concepts like RAID 10 for hot data and raid five for economy of space and though most database systems are moving away from having any kind of like system level storage redundancy instead opting for a clustered solution to get redundancy back meaning you have multiple database servers and if one little thing is wrong with the database server the whole database server falls out of the cluster but you also have other companies like Microsoft pushing refs and solving the problem again kind of at an application sloper system level there's also ZFS and btrfs ZFS isn't there in terms terms of efficiency but it's definitely there in terms of redundancy and failover and not needing anything special in the way of a RAID controller but on Linux there's MD admin or Multi-Device or linux's software raid and that actually is a really competent solution it turns out the biggest problems with software raid is that it's hard to guarantee rights are synchronized across the physical devices we like to abstract away you know it's like I've got a bunch of physical devices I just want one logical big block pool of storage Hardware raid have a battery backup unit or a large capacitor to store a little bit of stuff in memory to ensure that when the system comes back on after a crash or power loss that the drives are all consistent with each other you can't be guaranteed that they all lost power at precisely the same moment LX MD cleverly approaches this problem uh without the need for a battery backup unit to ensure that the uh system is consistent it really works best when the raid system has some insight into the file system that's running on it letd has some clever approaches to dealing with this but the main disconnect for closing the ride hole is that without a guaranteed consistent log of what's been written The Raid system kind of needs to know what's going on at the file system level in order to ensure that the file system itself is consistent after a crash Now intel for their part in this they've contributed much of the engineering development on Linux MD including some of the approaches that they've patented to closing the raid R hole without using a battery backup and they're part of you know those approaches are part of Intel's virtual raate on chip or v-rock enhancements to Linux MD for file systems like ZFS and RFS and even to an extent butter FS btrfs the raate five right hole isn't really a problem because those file systems themselves have been engineered with multiple physical block devices in mind the file system takes care of pooling a whole bunch of devices and and solving the raid five R hole themselves so the the only thing that you might bring up if you're still on team uh Not Dead Yet is CPU overhead doesn't it take a lot more CPU overhead to do the raid in the CPU as opposed to an add-in card well yes but also no uh to be clear we're talking about CPU overhead for par calculations only the CPU overhead for moving ginormous blocks of data around is almost non-existent in modern os's but for parody calculations there are some interesting approaches that out there to fill the void including companies like gr raid using a GPU to do raid parody calculations this might be cool but it still relies heavily on both the CPU and the GPU to do the work there's a hardware part of this raid solution but it's not a hardware raid solution if that makes sense Linux MD also lets you reserve cores for par calculations or limit the number of cores doing parity calculations which is really super handy and is basically not a performance bottleneck on Modern systems because of things like AVX 512 and AMX acceler ation for this kind of thing but in a modern multicore system you know if that really is a concern you can choose a lower overhead raid geometries or just buy a CPU with more cores by way of a tangent Linux MD and all software solutions that I'm aware of depend on the underlying devices to actually report read errors or read inconsistencies this is of course not true with ZFS and btrfs and refs but uh you can put that in that functionality back with Linux MD with DM integrity and the nvme specification also has a data Integrity field which I talked a little bit about the old version of that on mechanical spinning rust in my Hardware rate is dead video but the data Integrity field is kind of making a comeback to give us more uh knobs and tunables and more facilities for detecting errors on read that are not detected by the underlying devices in a relatively low overhead kind of a way which is encouraging to see if verifying information on read is important to you you should be aware of those caveats because well there's not a lot out there that does verify on read except for ZFS and refs and btrfs and things that are designed for these problems in mind so again there's different ways to fill the void it's not just one thing that drops in for the hardware raid replacement you've got options speaking of things that plug in a third thing that's nice to have is PCI hot plug and support for this varies from platform to platform theoretically if your nvme plugs into a controller then the host doesn't have to support pcie hot plug because you're plugging and unplugging a device from a controller that's really nice it solved a a lot of problems in the olden days with normal raid controllers you know 48 PCI Lanes to 24 drives let alone 96 uh I was quite surprised to learn how this works with Intel vrock under the hood if you haven't heard of Intel vrock vrock is Intel's virtual raid on chip solution it is software raid mostly but I think it's actually probably more accurate to call it hybrid raid because there are Hardware features to it it's mostly software it's available on Intel xon platforms and it is a separately licensed feature meaning that you need a key to unlock it whether it's on workstation zeeon or server Zeon when you do enable vrock it reconfigures the pcie root bridges in the processor to change how the nvme slots connect it changes the bus numbering and some other root Port parameters to better support hot plug and to better support a huge number of uh removable nvme devices that's pretty awesome if you use port multipliers or anything like that so Intel's done some engineering here vrock also enables outof band management you can manage your Linux MD array from the remote ipmi system on your motherboard and like add or remove drives take them offline get logs yeah vrock will let you do that there's actually a separate mailbox device that gets set up when you set up the Linux MD array the vrock way and that enables a communication layer between the actual vrock driver and the outof band management which is really actually pretty awesome if you dig into it a lot of the Linux MD uh programmers are actually Intel programmers and so Intel is responsible for a lot of the development on Linux MD it seems like even though you know these features do require a license unlock of course now most of it is present in Linux MD you don't need a license unlock to use it you're it's software raid you're going to get the software raid performance benefit but if you do have vrock you can get a small performance bump and the vrock volume works the same on Windows as it does on Linux that means you can dual boot and mount the same volume and there is no faster option with these kokia CM7 with Native Gen 5 support at Gen 5 throughput and latency on Windows than Intel's vrock solution setting up these arrays with eight of these CM 7s really couldn't be easier with this setup and even under Windows it is breathtakingly fast so this is Crystal disc Mark but I really ramped up the number of qes and threads because well this array can do 40 gab per second with just four CM7 drives this is vrock on Windows wait a minute is he saying that vck is our mythological future for raid with highspeed envme devices yeah Intel Engineers figured it out a long time ago regardless of how Intel is marketing it v-rock is the right combination of hardware and software to really dot the eyes and cross the te's for how these arrays should be closing the raid five R hole not necessarily even with a battery backed Ram or static Ram I mean they did build themselves a way to do it with optane but that's another conversation you don't you don't need that it works really well you get out of band management you get all the Enterprise features that you need and Intel has also left themselves a way to be able to do really inexpensive context switches for data transfers so even if you're moving things in and out of your AVX registers or your AMX registers the overhead for switching contacts from a user process to a kernel process for those ios's is potentially very low overhead right now it's Linux MD but on other operating systems like Windows it's a first class experience it's even a better experience than refs right now for everything except data Integrity so yeah vrock kind of is what it's going to have to be to solve a lot of these engineering problems shocking I know but Intel Engineers sort of know what they're doing I went down a rabbit hole here looking at v-rock performance on Linux MD native and then just Linux MD performance and then also lvm performance because lvm under Linux the logical volume manager also uses some of the Linux MD code under the hood and Linux MD is lower overhead than lvm and faster especially on the v-rock side at the end of the day digging into Linux MD for the options that it gives you for closing the raid 5 R hole and some of the other options that it has under the hood it it basically addresses a lot of the concerns that I have about Hardware in vme raid not only that but if you use something like DM Integrity to also get read verification on you know reading back from your array that sort of closes uh all of the concerns that I have about Hardware raid for really high-speed devices in 2023 and you the end user if you're a system administrator and you say I'm going to trust the devices to do the read verification I don't need that level of performance that's cool sign off on this form here that says that you're aware of that and run that with your solution and good to go because if there's ever any kind of data corruption or bit rod in the future we'll no uh there'll be a paper trail and there'll be fun stuff to do for that otherwise you can run you know DM Integrity or use the data Integrity field on your nvme device or move in that general direction or just verify your backups or whatever it's all good and just to make sure we're clear I'm not talking about vrock on the desktop you can do vrock with sat devices which has never made sense to me that is that is not a good user experience especially when we're talking about boot volumes I mean maybe it would make okay for make sense for a data volume but you know there's just not enough pcie lanes and not enough connectivity on desktop LGA 1700 processors for this kind of thing to apply that I'm talking about you know you want to run two Gen 5 ssds on a desktop platform there's not even Gen 5 connectivity for storage it's only the one x16 slot and getting a motherboard that breaks that out like we're not we're not talking about that and vrock did apply to SATA devices way back in the day it still can maybe that makes sense for bulk storage devices but I'm really not talking about that I'm really talking about for the future pcie storage that kind of connectiv it really only applies to platforms where you've got at least 40 PCI Gen 5 Lanes like Zeon workstation or other workstation platforms if you've been following this project on the level one forums you'll see that I've had a lot of fun sort of poking at this so again because I had a lot of questions Intel invited me to Innovation 2023 where I get to meet a lot of folks to work on this and get a lot of my questions answered so big thanks to Intel for making this video possible and big thanks to kokia for sending me the hardware and ALS also answering questions and giving me the stuff to experiment with and uh yeah Intel vrock it turns out taking all the pcie lanes into the CPU for all the storage you could ever need is already more or less what we need to fill the nvme raid void assuming that your your software doesn't have a better option built in It's the Best of Both Worlds in terms of latency and throughput and it's available for free basically on Linux unless you want out of band management then yeah you do got to get the key or if you're on Windows then you also got to get the key but you got 100 pcie Lanes of gen five and you want those gen five Lanes to go bur then this is probably the best option for you I'm want this is level one this is a look at okay if not Hardware raid what then what uh I'm signing out you can find me in the level one [Music] [Music] forms
Info
Channel: Level1Techs
Views: 133,974
Rating: undefined out of 5
Keywords: technology, science, design, ux, computers, hardware, software, programming, level1, l1, level one
Id: Q_JOtEBFHDs
Channel Id: undefined
Length: 28min 59sec (1739 seconds)
Published: Thu Feb 01 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.