"OpenZFS and Linux" - Nikolai Lusan (LCA 2020)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

first up is Nikolai and he's talking about open city fish thank you all right just before I get started can I have a show of hands of who's actually used ZFS all right good that makes a lot of what I wanted to talk about easier and I can skip some of the beginner stuff so open ZFS and Linux so Who am I oh that's me I'm Nicola loosen that's my email address you may see me around IRC I smell user or a variant thereof generally on freenode of TC sorry things have gone a little bit awry because the remote I wanted to use it's not there so opens that up s now with native encryption since version 0.8 open ZFS now has support for native encryption as well as upgraded support for trim for SSD devices and or NAND storage so it's a good time to get in because now open ZFS can replace most of the tools that you would have used before to create a comparable storage sack I have to mention the licensing obviously open obviously ZFS was originally developed by Sun and eventually released under their community developer license and the CDL and the GPL are considered incompatible layers have been consulted and most distributions will not ship with ZFS support in there they there may be customized I was out there for your distribution or you may have to roll your own I personally keep a live boot CD around with all the tools that I like to use and use that for all my rescues in installs sighing file systems are fun aren't they no they're not really why not well you get a lot of data loss file system problems have been known to cause a lot of data loss notably probably about 12 years ago there was a problem in XFS where there was a b-tree error and it caused a lot of people to lose a lot of data bit right which is an actually studied phenomenon and it's caused by hard drives you can go and research that yourself it's very interesting reading and performance storage is often the biggest bottleneck in delivering services not network not CPU not Ram often storage you just cannot get stuff through the pipe fast enough so why use NFS well it's cool part deny that I was a little bit dubious but I came around its stable and established it's been developed well by Sun and the people going passed through Lumos and into the open ZFS project and ZFS on Linux it's robust it performs really well it provides really good redundancy and the sorry the scrub tools and everything keep your data very safe I've seen I sort of quote around somewhere that said if you're not using ZFS you're losing data it's got really good performance out of the box just installing it building a pool and running it you'll get really good performance even better if you tuned it if you tuned it for your particular system your particular application it'll go it scales well obviously ZFS zettabyte file system it's designed to store massive amounts of data and do it effectively it allows you a better use of disk space a lot of the features that come that have come from this days of Sun and have been improved on allow you to store more on your disk space than you previously would have at minimal cost I might add and it has more features than any other file system the most comparable file system we had under Linux was btrfs which Red Hat has deprecated and development work on that seems to be slowing and even it did not have a comparable set of features to ZFS and the ZFS on Linux project is now the main contributor to the open ZFS project so it was also designed with systems administrators in mind sis admin Minicon we're at Vince plodding so it's very easy for systems administrators to manage storage through ZFS the tools themselves are very easy to use and provide you with a whole lot of stuff it changes your approach to data storage whereas previously and I've got some over simplified slides little that I'll show you in a minute whereas previously it was very hard to grow storage or shrink storage or move stuff around it's a completely different approach to anything we've had in Linux when I first started using it it was the most revolutionary thing I'd seen since LVM came on the scene it works well in bare metal and in virtual environments I run it on bare metal and on VMs and the good thing about running it on hosting hosted VMs is you don't have to worry about the backend someone else's problem and there's the built-in ability to share over anything NFS Samba I scuzzy if you can run a daemon for it on Linux you can configure it to be shared out from the ZFS toolset so the traditional filesystem way out and again this is overly simple simplified and I don't have my laser pointer on me so everything for a second but basically on this diagram everything in red look man there's my mascot everything in red is a block layer device from the raw disks up through your raid through your encryption cycle to your logical volumes and then you set your file systems on top of that and even your El VMs are block devices so there's a lot of work being done oh thank you okay sorry not entirely okay okay anyway so I'll put that down so there's a lot of work being done there and there's a lot of different tools involved in managing that so you've got from MD atom MVA turret to manager raid through Lux to do your encryption through LVM and all of those wildly wonderful tools that we've all used so how is that if it's different it's copy-on-write so no data is destroyed instead it's copied a new copy is created when you're right so the old stuffs still there it actually makes it faster you're not looking for the same I know it on a disk or raid array or whatever your storage happens to be it abstracts the storage from the disks I love that has internal measures and I'll go into this a little bit later because this is key to some of the tuning has internal measures that replace a lot of the way that Linux has traditionally used file systems including caching so it uses pools of virtual devices rather than individual devices tied together through a bunch of different tools like I showed before and the data is stored in datasets that are similar logical volumes but they're far more configurable and far more suitable and again this is overly simple simplified but again everything in red is a block device everything in purple is a filesystem so everything from your from the minute you create a ZFS storage pool you have a mountable filesystem something you can write data to and we'll get more into the Vedas a bit later again configuring your videos properly can increase your performance dramatically and I have some horror stories that I've heard from people in the community actually using things so videos virtual devices they can have different geometries they can be video each video can be a different size and they can be of a different geometry and you use multiple videos to create your storage pool so the videos is pulled together to create your storage space and the rights are striped across your videos which increases the throughput of your storage system and losing a video does mean losing data so when you're building a system you want to build in redundancy otherwise you're losing data pools are made up of one or more v-dubs and the writes are spread over them so there are mountable filesystem in their own right as I just said and there are many pool level attributes that are inherited by data sets so when you create a pool you can set and forget a bunch of a bunch of features that will automatically be inherited down things like record size [Music] the ACL method all of these sorts of things and the pools can be moved from one machine to another you can literally take all of the disks from a pool stick them in another machine and import that onto a running machine so if you have a disaster recovery situation Hardware you know CPUs a burnt down or whatever but you can still extract the disks you can still get your data back and when creating pools you need to remember that some of the settings that you set are immutable they cannot be changed once they are set in order to change them you have to create another pool and in some cases with datasets it's the same datasets this is where you actually put your data and you do most of your work so they're created from the ZFS pools and they have a tunable set of attributes there are a lot of Chernobyl's in ZFS says something like and I've got a slide for it later this like 230 kernel module settings and depending on the feature set you're using 72 over a hundred two novels for each data set some of the attributes cannot be changed or right so they're mountable in arbitrary locations every time you create a data set you can create a mount point and it doesn't matter where in the pool or what pool it's in you can you can have /var on one pool and slash bar slash spool on another data pool so then that you can mount any data set anywhere you want I recommend when you're creating your data sets that you actually build them in a hierarchical level even if some of those data sets are empty sets that are set to cannot mount Zed volts this Zed of has volumes so they're block devices this is a way that you can actually create a block device as part of the ZFS system that you can share out sorry I keep going back to this because my arm Bluetooth is not getting enough connection so the multiple users including swap if you're running ZFS on route you're probably using a dead volume as your swap device if you have to swap one at all and part of those multiple users users are sharing out via I scuzzy for sharing for building sands and for hosting virtual machines you can do all sorts of other things with them they have similar to loopback file systems or you can put back file systems on them they can have an arbitrary block size and but they're not as performant as raw data sets and later on I'll talk about tuning ZFS for running virtual machines and there's a little bit of controversy amongst the community as to how you should do it but the internals of ZFS mean that creating a dead volume puts extra steps in the kernel chain to actually write data out and they can be exposed to the operating system in different ways you can explode it exposed them either expose just the petitions on the on the volume or you can expose them as if they were a raw disk ah so arc l2 arc and as long as ill so arc is the adaptive replacement cache and this is very similar to what you would see with file system caching when you were using NTFS or XFS whatever your file system of choice previously was it takes up a lot of RAM by default the arc will take all to half of your RAM unless you configure the kernel module properly and it's not recommended to run ZFS on a machine with Wes than some people say - some people say 4 gigabytes of memory because of this so l2 arc is the layer 2 as it's pages that have been evicted from the main arc in RAM and written to a disk normally that disk would be faster than your main storage so if you're using rotational drives which we all know a slow as your main form of storage because let's face it they're cheap you can have your layer 2 Ark on an SSD or and then device like an obtain an m2 Drive so the s log or the intent log is what's used by ZFS actually speed up writes to make things look like they're happening faster it's kind of like a journal it can save you from data loss in between boots if you have a crash and basically what happens in the write process is when ZFS goes to write a page out it will perform if an s log device exists it will actually write to the S log first so it wants to it will and then the idea is that that is faster than your back-end storage and so you get a faster response that just stores the information and then is periodically flushed out and if you're moving around a lot of data and I've moved a lot of data back and forth between pools you can see the size of that intent log increase exponentially so if you're moving a lot of data between pools or around a pool you can you can actually see that log grow so you need to take that into account as well so there ZFS tools the zpool she use for creating and maintaining your your storage pools you use it to add disks remove disks to set those arbitrary to set those top-level features and then there's the heavy liftin which is the ZFS program itself and that's used to manage your data sets to set the attributes to see what the attributes are to see how much so how much data is being stored in each data set snapshots all sorts of information is is mainly accessed via that there's also Z which is a ZFS event daemon it's a very useful tool especially if you're doing a lot of snapshotting because at the end of an event you can actually run a script you can set there's the daemon up to actually run something so if you do a snapshot you can then trigger moving that snapshot around or maybe deleting an old snapshot there's a lot of things you can do with that there's also zdb which you can use to get even more information out of ZFS about the state of your storage system it's very useful if you're debugging problems so creating videos always use a disk or partition name and I say disk or partition it's better to actually feed within the entire disk because it will sort out the petitioning of that disk based on the sector size so use names that will remain constant if you're doing very large installs where you're using sands that have multiple storage bays you can actually map a particular Bay in a storage cabinet to a particular video that's a little bit more advanced that there's a little bit more involved in set that up but it's it's doable and remember that not all the disks need to be the same size so ZFS when making a video will be the site the the video will be the size of the smallest disk say in a mirror it will be the size of smallest disk in the mirror obviously in raid it will be the combined size - whatever parity not all the videos in your pool need to be at the same type you can mix and match to your pleasure so the three types single disk or petition a mirror with no limit of devices so you can have multiple levels of parity in a mirror you can have triple quadruple you can parried it however you want and there's raid Z which is raid set itself is the equivalent of raid 5 there's raid Z 2 which has double parity and raid ZD 3 which has triple parity and this is just an example of the Linux disk names I've obviously brought out these serial numbers of my devices but you can see that you can using disk by I'd the disk by ID path you can identify each individual disk and feed that name to ZFS the only other thing there that is immutable is the w w ends which I don't think are actually oh sorry they are limited there on the end of the disc player IDs as well so you can either use the ATA path names or whatever scuzzy path that happens to be or the WWE ends often they're the only things that that don't change so when you move those discs or things line up differently you can do that snapshots they provide a glimpse of the data set taken at the time they can be used as a form of backup which is snapshotting your data I cannot stress how important that actually is snapshots are mountable and readable so that you can recover files you can do what you want with them they take up space but they only take up a delta from the most recent snapshot and over time as you delete snapshots that all gets aggregated into the newest snapshot they're not automatically deleted so they need to be managed and there are existing tools said snap DS the one I would recommend there is also ZFS auto snapshot and there is a variant of that that's a bash script and it's it doesn't send and receive things as well and snapshotting can be enabled or disabled per data set using snapshots website backup so ZFS allows you to do a send and receive of any data set for pool possibly recursively between two pools you can send one and receive it the other you can do it on the same machine or you can do it over network pools don't even need to be on the same machine like I said and they don't even need to be to a dedicated pool you can I have my my VMs backing up to a separate pool with backup sections so it's slash server slash backup slash server name dump the entire pool into that and the most common transfer is via SSH setting up SSH securely between your own network of machines is your problem I've got it handled on my own M buffers another common tool but the problem I found with that it leaves the ZFS receive hanging but technically you could do the same thing with netcat or any tool that lets you transport data over a network so Chernobyl's like I said just a default grep of the module information says as over 232 Nobles for the kernel module alone and over 75 for each data set compression and deduplication this helps you get some more performance and some more space deduplication doesn't necessarily work how you think it does however so there's native file system level compression those are the 4 options you have L said 4 I'll start with lzg 8jb that was actually the first compression method designed for ZFS it was designed specifically for file systems to be fast to compress faster decompress l said for is its replacement and it is the default when you turn on encryption you can also use gzip at all levels between 1 and 9 and I compress my logs at gzip - 9 and don't worry about running compress in the log rotate zle is a very fast compression method that only compresses zeros deduplication is RAM intensive and deduplication block deduplication in ZFS only works when you're copying data around a ZFS pool itself I only enable it where I'm doing development and building because those things you're often copying files around moving a lot of things within a particular tree structure and that's where it sees the most use it's no use if you're storing video files and you think you're going to get a little bit of extra space by turning on D Joop you're just wasting your RAM because it will not detect duplicate blocks within those files but both can help you squeeze more storage out of a smaller amount of disk particularly compression on email logs school source code there are a lot of things that it's useful for so optimal optimization for everything you want to tune the arc to fit the needs so if you're just running a storage server you might want to increase the amount of RAM that's used if you're using ZFS on a database server or a web server you're going to want to leave some space to the application to run and for the application to have its own caching you can tune the Metis lab performance so that the amount of data you've got written over things it spreads the data more evenly the more videos you have the more speed you're going to get the more it's it's like right raid in that way and you can tune the Ark and the L to Ark performance and the trim limits for SSD storage and easy training for most people is to create your pools using a ship 12 that puts it on a 4k block most modern disks or a 4k block a lot our SSD is now coming out with 8k box older drives will be using 512-byte blocks and there's a list of devices in the source code that actually lie about being 512 byte block sizes when they're actually 4k block sizes so enable VL zone for compression it's worth it it'll save you a bit of it will use minimal CPU and save you some space default record size of 128 K is a good idea and disabled things that you don't need a time is a big killer for disks setting the log buyer bias to latency the other option is to throughput by setting it to latency actually makes your rights faster setting it to throughput means it will actually verify that the disk has bit that that data has been written out to disk before it continues setting the sink to standard or disabled will use the standard POSIX way of doing things and so for mysql or mariadb and it's fun this is fit you know DB only you said the record size to 16 K because that's how that's how MySQL reads primary cache to metadata only and the log buyers to throughput because it has a sonication and it's a similar setup for Postgres sorry I'm running out of time here in writing VMs and I'll just bring all of this up at the moment the use of video of said vowels is not recommended and you should make sure that all of your record sizes and everything are going to line up with the file systems that you're keeping on those VMs I recommend using the qqql two files on a dedicated data set you actually get better throughput on that than you do using more in my experience it's a controversy it's an argument and it would appear that I'm out of time full set of slides will be available to you with notes attached and and that's just some resources that are available [Applause]

Info

Channel: linux.conf.au

Views: 6,078

Rating: 4.8591547 out of 5

Keywords: lca, lca2020, #linux.conf.au#linux#foss#opensource, NikolaiLusan

Id: 7tFL0NFBwUc

Channel Id: undefined

Length: 30min 8sec (1808 seconds)

Published: Wed Jan 15 2020