What Is ZFS?: A Brief Primer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
we get a pretty good question on the formula today and in reply to some of the videos that I've done recently which is a very basic what is ZFS exactly like you know I'm a computer enthusiast and I want to know about CFS or ZFS the file system for storing files I mean I know people talk about it it's almost like meme status at this point but I'm gonna try to in very basic terms explain what it does why it's awesome how it compares to btrfs other file systems that are available and why it's different I mean there are things that make it similar to existing solutions but there's also things that make it different let's stop it I'm gonna try not to be super philosophical in this video although it helps me retain things when I have some insight into why things are the way they are and you know ZFS is one of those things it's almost like alien technology it's almost like there's this random thing that was discovered and then all of a sudden we have this magical technology and the reality is that Sun Microsystems are the original developers of ZFS or as NFS then FS is you know cuz some people don't say Z they say said so ZFS the letter Z file system the reality is that Sun and kind of their death throes they open sourced it and then Oracle gobbled up Oracle the database company sort of gobbled up Sun Microsystems in their intellectual property and you know sort of put the re-animator juice in Sun for a little while in terms of hardware that kind of thing but you've to understand that Sun Microsystems was like the super-secretive Ferrari type computer company way ahead of their game they saw things before anybody else saw them I saw things before they were really commercially viable they were operating it and basically like you know an IBM or a General Electric or a Westinghouse level of research and development but they didn't really have the means to commercialize and capitalize on that level of research and development and with computers this is pretty much always universally been true and this sort of came out with my interview with Greg Crow harmony in today's computers barely work there I mean you know kensic no no a computer is is reliable and a lot of magic has been done to make a computer appear to be reliable but under the hood it is a fantastic amalgamation of you know unicorn blood and pixie dust keeping that thing running it's really completely insane and that was true with Sun Microsystems they were doing such incredible incredible work that you know you don't necessarily always trust the computer to do what you think especially if the computer has been on for a very long time like I'll just restart your computer and everything will it's like no we want a computer that's just gonna run forever and and and be very good and so ZFS is all about storing files and having impeccable data integrity and doing it with basically commodity hardware because we don't want to engineer custom hardware to be able to deal with this or even semi custom hardware you know not to get sidetracked or way late for a second but we'll talk even at like a very microscopic level like physical mechanical rotating disks the sector size 512 bytes even that's not set in stone there are solutions like the IBM NetApp I mean you know sort of famously we did a 172 terabyte storage server video where I repurposed NetApp shelves by default the drives for those shelves come configured to store 520 by sectors not to 512 by sectors and the reason for that is because extra checksum information is stored in the difference between 512 bytes and 520 bytes the extra 8 bytes store checksum and so the computer knows something has gone wrong if a mathematical relationship between the 512 bytes of information you actually want to store and then a computation based on that 512 bytes doesn't match spinning rust I mean it's like how efficient do you think a record player is for you know seeking and playing back music you've got your you know you've got your 300 pounds of plastic final collection that has all of your music library on it that's like I'm gonna I want to build a mixtape out of this without having tape how are you gonna mix things together it's like well you could get more than one I guess and it's like you've got to unload the record and plop it down and then gotta get the needle and you got to find a thing that's basically a hard drive a mechanical hard drive it's a it's a spinning platter of rust that is riding impossibly small ones and zeros to a spinning platter of rust and there's a physical read/write head with a magnetic ceramic head on top of it that's going to those bits I mean it is it is a it is a fantastical Rube Goldberg machine of neodymium copper and data storage and it is it is a miracle of those even work and so Sun Microsystems facing you know that sort of mechanical challenge in the drives lie the drivers lie a lot they'll fail they're not know they'll not know that they fail some company solutions with that a whole 520 by sector thing so you write a piece of data but then you take that data and use it as a computation it doesn't matter if it's image data or document data or text data it doesn't matter you run that data through some type of computation it gives you a number and depending on the algorithm it can be fantastically good at detecting that something has changed something is wrong and so that is sort of the building block that checksumming process is sort of the building block of redundancy of knowing that something is wrong and so that's one of the cornerstones of the functionality in ZFS it does a lot of stuff ZFS is hugely complicated so the other thing to keep in mind with mechanical hard drives when ZFS was designed you know flash memory wasn't really a thing and there was not really anything between you know memory like RAM which is very fast and this you know rube goldberg spinning rust contraption which is quite slow by computational standards and since Sun Microsystems which we're talking about like the 90s and early 2000s computers processors the processors and computers RAM storage other kinds of storage other than mechanical hard drives have gotten exponentially faster five billion cycles per second just you know tens of gigabytes or hundreds of gigabytes or thousands of gigabytes per second for different levels of memory that come off the processor things have gotten insanely fast good old-fashioned mechanical desktop hard drive from you know a hundred gigabyte mechanical hard drive from like ten years ago you could read information off of that thing at about a hundred megabytes per second nice new modern a terabyte mechanical hard drive you can read information off of that at about 200 megabytes per second so we've gone from you know a hundred gigabytes to 18 terabytes and we've increased our transfer speed from 100 megabytes per second to about 200 megabytes per second wait a minute the areal density of information on this on this rust bucket has gone out from 100 gigabytes to 18 terabytes you mean to tell me we've only doubled the speed yeah pretty much envy me hard drives you know solid-state flash hard drives those are in the thousands of megabytes per second the several thousands of megabytes per second so 200 megabytes per second and this is the best case scenario is quite slow uh-huh and remember get that read/write head the needle you wanna read into songs at once ain't happen unless the drive wastes all of its time bouncing back and forth between two different things so while the computer is waiting on the hard drive to do stuff there is a lot of computational horsepower available that's what Sun sort of took advantage of with ZFS is like hey these mechanical hard drives even if we got a billion of them are very slow and so we actually have a lot of opportunity to do computations as we're looking for informations now the other really big advantage of ZFS that sort of comes out of that is the fact that ZFS has underlying knowledge of how the disks in your storage pool are arranged it's really designed with more than one mechanical hard drive in mind and most file systems are not and so like a file systems like you you have a device you have like a flash drive and you format the flash drive to use it I feel like that's a process that everybody's familiar with the deal with ZFS though is that it knows that this single device on its own is probably unreliable and so it's designed to operate with multiple physical pieces of hardware in a way that it sort of exploits the fact that one of these devices may be lying on an individual piece of data level like at the sector level although it doesn't do it by storing it does do it by storing extra information but it tries to do it in as efficient a way as possible and I'll come back to that in a second but it's sort of once the information to be striped across multiple devices because you get into in efficiencies if you do it any other way see if I have a if I have a piece of hardware this is designed to store redundant and checksumming information along with the actual storage of the information then there's not really a performance penalty but if we go back to our 512 byte sector mechanical disks suppose I want this store and extra checksum that extra 8 bytes well the smallest amount of information in mechanik larga can deal with his 512 bytes so if I want to store an extra 8 bytes of checksum I've actually got to do to 512 byte rights 512 bytes for the data and then I'm gonna put that other 8 bytes somewhere else so that's wildly inefficient and that's true pretty much everything they didn't there's not really like commercially the people that make these types of mechanical storage devices were like hmm we can charge a lot more for that 520 byte capability if we sell it to enterprises and server people so we've got this one piece of hardware that for the home user that's looking for 512 bytes they can buy for this much but somebody that's gonna buy it in the commercial space they're gonna pay twice as much or three times as much or five times as much this is the IBM model Hey NetApp we're back to net app that's how net app works you know the NetApp storage appliances like Wow will do 525 we'll do all of the engineering and everything necessary to make these things deal with 525 sectors and so we've got fiber in 12 bytes and you know it's backward compatible if anything needs direct access but we can also store our checksum information ZFS you know some sort of solve it rotting on the wall they sort of brown on the wall of where Google would be like Google's computers their computer infrastructure was literally made out of garbage they just had motherboards laying on cardboard and you know philosophically you didn't want to do things like that you wanted to buy like really expensive five nine super reliable hardware Google was like we're gonna literally run our stuff on garbage and we're gonna have software say up that machine burst into flames and caught on fire well eject it from the cluster and let's just continue computing man doesn't doesn't really matter this was just her radical heretical thinking and so ZFS kind of treats mechanical storage that way you get a mechanic's hard drive that's gonna store information in my lie to you you've got a pool of say four mechanical hard drives ZFS has mechanisms in it where it will try to stripe the right like if you get a lot of informations you want to write it'll try to keep all four disks busy but it'll try to do it in such a way that it's not wasting a lot of i/o storing the extra checksum information but also is taking advantage of the fact that you can divvy up that workload and get up to a theoretical performance bump of up to four times the problems that you run into with ZFS I will come back to because it's not always a perfect abstraction when you're just doing a regular flash disk effect this and you want to format it the file system really doesn't know how to deal with multiple physical devices that doesn't work well you need a file system that's really complicated where the software for the file system is really complicated so that it can deal with multiple physical devices a traditional RAID controller like you may have seen a RAID controller and a RAID controller is just a physical PCIe card that then your drives plug into that but it's really just a computer on a car you it's a computer within a computer it's Turtles all the way down so that computer within a card is doing some extra calculations and it's doing some extra computations and it's storing some extra information maybe theoretically possibly on the drives that you write to in hopes that things will fail in a predictable way the problem is anonymous and very very ancient videos is most raid controllers really don't do a lot to store extra information in the case of a raid one for example it's a perfect mirror and a RAID one configuration the RAID controller is storing a perfect mirror of drys and the assumption from system administrators is that if they remove the RAID controller from the equation and just plug one of the drives in the mirror in it will continue to work normally so the RAID controller really can't do anything to manipulate the filesystem or store any extra checksum information beyond what the filesystem I support and newsflash most file systems don't support storing a lot of extra redundant you know checksum information or anything to know that the information that the drive is returning is correct so you could have a hardware RAID controller with a raid 1 mirror 2 physical drives that have two physical pieces of information and you know let's say that you've got a document you're working on your document your big presentation and you save it and you come back the next day and you load it and it's corrupt you could still actually have a copy of your presentation that is not corrupt on the other hard drive in the mirror the problem is there's not a hardware RAID controller on the market that will tell you which one of those files is correct and actually it gets into data forensics where you send both drugs and the the the raid card off to a data recovery place and they'll image both drives looking for differences and then a human operator will literally pick different combinations of things to try to find a file that's a valid format or something like that and the reason for that is because again the format that the raid card uses is not proprietary it doesn't store extra information if it did the performance would tanked because of the whole 512 byte sectors thing so it gets a little more complicated when you talk about raid 5 it's not until raid 6 the controller that ready to hardware RAID controllers I actually have a leg to stand on in terms of figuring out who's lying and the reason for that is because with raid 6 you have extra redundancy information you don't just have redundancy information you have two sets of redundancy information on two different drives so in the case of returning you know I got my presentation file I'm instead of my presentation file living intact in a mirror on two physical devices in a rate in a minimum raid 6 configuration it's for drive four drives I would be in a situation where my file is broken up into two chunks and there's two bits of redundancy information spread across all four drives and if one of the drives is lying it's possible to use the information on the other three drives to figure out which drive is lying and it's possible for the RAID controller to return to me the corrected file the file that has the corrected information by assembling those you know four chunks figuring out which chunk is bad tossing that chunk maybe throwing an error and saying hey the informations there and sending it on four drives incidentally is also the minimum drive configuration for raid ten RAID 10 is great because it gives you both high performance and high ops with the problem with raid 10 as it's implemented on every hardware RAID controller that I've ever tested is that it doesn't know when one drive like when one drive gets corrupt or out of sync with the other drives it doesn't know which drive is lying which drive has which drive has bad information which drive has good information with raid 6 it's possible to compute that and of course if you have a six or eight or a ten drive raid 6 it is possible to compute that a lot of raid controllers implement a function called Patrol read the Patrol read all that does is just scan to look for inconsistencies and in the cases of my raid one inconsistency depending on how its inconsistent when it encounters that inconsistency where one of the drives has a correct working presentation file and the other Drive does not for whatever hardware failure reason on that particular Drive it is possible that the inconsistency will be corrected to corrupt my presentation depending on which drive the controller goes with so if the if the drive itself throws an error sometimes that's a clue for the RAID controller so like if the drive tells the controller hey I'm not feeling so good I'm gonna return data but you know I'm also generating some errors then some RAID controller firmware will say ok this Drive is thrown errors let's maybe not use the data from it let's use the data from the other drive in the case of a red one mirror and so it may just the slightest error it may each act that drive but then again it may not then again that the drive may not report to the rate controller that it is failing philosophically ZFS is a completely different situation philosophically ZFS does not trust anything to operate correctly and it especially does not trust hard drives to operate correctly it does not trust hard drives to self-diagnose hard drive can be a silent carrier it could be you know infected but asymptomatic as it were and so ZFS is constantly challenging the drives and the information that's on the drives mathematically with that checks on and so ZFS is unique because it is a file system really but it also manages the volume so like the introducing some new terminology here but the volume management is really the volume that your data is stored in which is across a whole bunch of disks and so the file system is aware of how the disks are physically laid out and how many disks there are and that kind of thing and you think about the the other scenario that I was describing with a raid array and you've got you know multiple mechanical hard drives and they're presented to the computer to the host computer as a one large device that can store information and so when you format it is you know NTFS or ext2 or you know your Linux Mac OS whatever the FOSDEM doesn't know there's multiple physical drives involved whereas with ZFS it does know that there's multiple physical drives involved so ZFS is handling redundancy because you can tell it how we're done that you want it to be there's raid z1 wherever onedrive can suffer failure raise easy to raise III three times can suffer failure raid also has straight mary or ZFS also has straight mirroring the difference is that ZFS does actually store extra checksumming information with its mirroring so it will know when a mirror is lying of course the penalty for that is IO you have extra IO that you have to deal with in order to enjoy that level of redundancy that level of reporting and also not a lot of operating systems support you know is guaranteed to support ZFS although at this point in history we have pretty decent support for ZFS on Windows it's it was going to be a Mac OS thing and then you know Oracle did some saber-rattling and Apple sort of cancelled including ZFS but ZFS was gonna be the next gen Mac OS file system from what I understand and of course Linux has ZFS analytics which is quite good and FreeBSD is sort of the first citizen for ZFS because you know ZFS has a billion dollars of development probably more software development engineering time into it it is an incredible file system but with that comes a lot of overhead so because ZFS has all this useful stuff built in like checks coming and it's aware of your physical disks and make you manage volumes and the ZFS also has this concept called datasets where you can create a dataset on the filesystem and it's kind of like a volume on LVM like a sub volume but you can actually change some of the parameters about how it works with compression that kind of thing there's a lot of features in CFS things like deduplication although that adds makes it even slower and adds more overhead and generally you shouldn't use that unless you have a lot of memory and you're building a dedicated storage appliance but you can do other things like deduplication on and off not on the global level but on add a sellable data set is just like a slice of your storage but it's not as hard as a partition so you might have heard about you know you can take a disk and partition it up into thirds and your sort of hard set that you know this this partition is a third this partition is a third of this partition is a third with ZFS and and in most other modern logical volume managers LVM on linux are LVM too it works that you can just say I need three buckets and depending on how much stuff is in each bucket it just will the underlying storage mechanism will just take care of putting that wherever it needs to and to get everything to say it so if you have three buckets but by the time you were actually filling up your storage medium one of the buckets is twice as big as the other two buckets that's okay you don't have to decide that ahead of time with most modern logical volume managers and with ZFS so that's all the cool stuff about ZFS and that's why it's kind of different because its file system and volume manager and and device managers sort of all rolled into one but it also brings some downsides like for ZFS to do its fancy recovery thing you can't really add a single disk at a time to storage pool and ZFS really works best when you have more than a single disk if you have a single disk with ZFS it can only tell you that the data is corrupt somehow and you might not be able to get it back like if there's there may not be you know extra redundancy information on the on the disk to recover your information you might have something like depending on your setup I don't say universally you're not gonna be able to recover it but it's a bad situation if you have a single disk it's formatted with ZFS and see if it's telling you hey I'm getting data crushed and you may be wondering it's like how much of a problem really is silent data corruption it's not an insanely huge problem but it definitely can occur and you've probably experienced it and not even realized it so like if you have a large photo or movie collection or an mp3 collection that you've had for years decades maybe and you know all of a sudden one day you notice that one of your songs cuts off or you're you know looking at a really old picture and like half the pictures fine and then the other half of the picture is like mess up noise and weirdness that's bit rot you you got file corruption in your storage media some device was like yeah here's your file here's what you gave me and it's like why is there coffee stains and just terribleness on this I don't know you gave it to me like that it's like no hard drive I didn't do that so you don't necessarily notice until you go to load the thing again and look at it and be like that's not what that should be with ZFS you get an early warning that something bad has happened and you need to deal with it that's called ZFS scrubbing and by default that's a scheduled process that happens weekly and generally it it goes quite fast I mean it's just how long it takes to scan your your data now if we're talking about 18 terabyte hard drives that are only 200 megabytes per second you know you do the math it takes a while to get through 18 terabytes to 200 megabytes per second so you know depending on how often your data changes and some other parameters scanning once a month might be you know more appropriate as opposed to once a week because it takes a while hit through 18 terabytes it's 200 megabytes per second sometimes you see ZFS compared with other file systems there are opossums like a btrfs or butter FS and butter fests you know philosophically wants to try to address some of the same things as ZFS but bottom line ZFS is much more mature than btrfs and that's gonna get down boats and people are going to complain about that but it's the truth btrfs is nice because it is less overhead than ZFS and in some scenarios it can be much faster than ZFS but generally if you care about your data integrity ZFS has been better in my experience btrfs has you know historically had some fairly famous bugs around raid 5 and raid 6 then we're not super well tested a lot more development effort should have gone into unit testing and integration testing or writing automated tests to run through gigabytes it's on terabytes of actual integration testing and failure simulation to be sure that the code worked correctly ZFS has most if not all of those test fixtures in their development pipeline BP RFS might be nice someday but it's not quite there there are some rough edges with ZFS and ZFS on Linux things like in order for that checksumming that extensive checks something to go fast you need access to the processor that the Linux kernel does not necessarily provide the cleanest access to in you know four modules and things like that there's a little bit of a kerfuffle about that recently and so performance you know may or may not be there with very high speed devices like nvme devices especially if you have an nvme array the ZFS hasn't caught up yet remember ZFS was developed in the 90s and 2000's when mechanical storage was glacial in comparison I mean remember just super super slow compared to how fast everything else has gotten in the computer mechanical storage is still just terrible the ZFS is catching up but it hasn't quite caught up there it doesn't have anybody you know spending an insane amount of money on R&D that's giving away for free I mean Oracle might have some really amazing solutions here but the open source stuff not so much so it can be faster to run like Linux LVM plus some of the other stuff versus just ZFS on uneasy mate you can totally run ZFS on nvme or write one well raid Z one of nvme you know mirroring GFS Mary on nvme is the efest marrying on evening mate is very fast much less overhead then raid-z still not as fast as other file systems definitely not as fast as ext4 because it doesn't have as many features so there you go I've done some other videos on this like setting up raid on Ubuntu and setting up raid on Arch Linux and setting it up with MD admin which is like the old school way of doing it and also Linux is LVM to Red Hat has put a lot of work into LV and the logical volume manager for Linux and in the logical volume manager it now supports raid properly so you can create a volume you can take a bunch of disks and assign those disks to be storage medium in LVM and the neuchâtel LVM I want this volume to have this level of redundancy kind of like raid and LVM will handle that for you and so you can get raid 6 type functionality which has a very low performance penalty overhead for the check something that it does because it's doing that it's doing the rites in such a way that you don't end up with extra rights to the device that you're storing information on it the parity information ends up being stored on the the i/o operation for that parody ends up going to a specific device so there's not as much of a performance penalty overall for the only with that type of thing but all that raid stuff and LVM that's really the subject of a different video and I've done those you can check that out and if you want even more information but hopefully that sort of gives you a really kind of high level understanding of what ZFS is and what it does basically you know in a nutshell ZFS combines filesystem management and volume management and device management but also data management in terms of no matter what you give it when it gives you back something it will make sure that what it's giving you is what you gave it mathematically otherwise it will fail it will say I don't know you gave me a thing this is what it should be I don't have it it's broken there's their devices here the pool is degraded restore from a backup I can't give you what you gave me I've made sure of that mathematically the chances of it failing in such a way as to return bad data and not know that it's bad is a is a near mathematical impossibility so you know the computer you know computers barely work if they didn't work at all then maybe but because they barely work and because of the structure that's set up here if something is going wrong mechanically ZFS will let you know and it's not unique in that capability but it is so hardened and it has so much engineering time it has so much developer love put into it over the years for not the open source community but like as a commercial product is that I feel confident in trusting it but any data where I care about the data I don't want the data forever and it's not a substitute for backups or anything like that although it does have a lot of features where if you want if you have your stuff stored in a ZFS pool you have a remote ZFS somewhere else there are tons of helpful things built into the file system to send only the changed parts from the local ZFS system to the remote ZFS system son they really did think of everything they did well I did you know imagine Google level IQs and engineers working on this stuff 10 20 years before Google and that was son it's like we didn't we didn't have the internet we didn't have bozos on YouTube waxing poetic about oh this is amazing but that's why you know that's why son enjoys the you know cult hacker super amazing legendary status that it does because ZFS is not the only technology that Sun came up with that is like this where it's just like this is you know some sort of crazy alien technology literally right out of Star Trek because it's so insane this has been 30 minutes of an introduction into ZFS but if you find yourself wanting more or you use ZFS for your day job definitely check out the books by Jude and Lucas ZFS and advances ef-s there's a lot of hard won battle knowledge in there and they are well worth the price you should be just even gonna use ZFS in your home lab it's worth having these books on your shelf bundle this is level one this has been a level one explanation I guess it probably should go on a Linux Channel let's do that thanks again to our patrons and our floatplane subscribers they make this kind of content possible and also maybe you can get a haircut in a week or two start hoping things back up here [Music]
Info
Channel: Level1Linux
Views: 104,918
Rating: undefined out of 5
Keywords: technology, science, design, ux, computers, linux, software, programming, level1, l1, level one, l1Linux, Level1Linux
Id: lsFDp-W1Ks0
Channel Id: undefined
Length: 31min 49sec (1909 seconds)
Published: Fri Jun 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.