Explaining ZFS LOG and L2ARC Cache: Do You Need One and How Do They Work?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Thanks for sharing this.

👍︎︎ 10 👤︎︎ u/lawrencesystems 📅︎︎ Mar 17 2022 đź—«︎ replies

Hi, this is good but there are a few points it might help to add:

Sync writes are written out to ZIL blocks, but any outstanding async writes within the same sync domain are also written out when that sync write comes through. So, if you have async writes to a file or zvol that have not yet been committed, and then a sync write comes through on the same file or zvol, all those async writes will be made durable. This is central to providing a consistency guarantee.

It’s also one of the leading ways that people get ZVOLs “wrong”. A zvol is a single sync domain and to accrue async data in memory, separateable sources of cache flushes (like fs journals) should go to a separate zvol, otherwise performance will suffer badly.

One of the biggest consequences of using a SLOG is that all sync writes can go via “direct sync” - they are written literally to ZIL blocks. Without a SLOG, large writes go through the “indirect sync” path, which causes RMW and compression and checksumming to happen inline with the sync write request. Inline RMW can destroy sync write performance and amplify IO. This effect is often greater than just being able to move the ZIL writes to another device.

In addition, blocks written by indirect sync consume an extra metadata block which is fragmented by the data block. Reading them later can double read IOPs.

While a TxG commit happens every 5s by default, that doesn’t mean you can just use that as a yardstick. The transaction group has to then pass through both the quiesceing and synchronization phases, which can take additional time. In addition, small ZIL writes can take double the space, since each one comes with a metadata block “header”. It’s much safer to assume you need 1/2 of the ARC size, as this is twice the max dirty data that is normally held in ram.

SLOG devices should be on their own namespace if on nvme, and should be overprovisioned.

👍︎︎ 5 👤︎︎ u/taratarabobara 📅︎︎ Mar 19 2022 đź—«︎ replies
Captions
tom here from lauren systems and i have a new zfs shirt so i want to talk about cfs seems like the appropriate shirt to wear for this topic now i did a video called cfs as a cow but obviously there's more to the story the copy and write semantics are really important to understand but what's the next question might be what about those specialized v-devs such as log or cash and don't those also play a big role in how that data gets to or egresses from that particular pool and you are correct but there's a little bit more to the story it's actually a very complicated topic one that well you'll find lots of discussion on many different forums about this topic so i wanted to put some numbers together show you how this works and not just talk from an academic standpoint which i'm going to have everything time index because there's going to have to be a little bit of talking going on to explain how these things work and try to be as concise as possible but then we'll get to the demo part where we demonstrate how you functionally add these or remove them on the fly inside of you nas and what happens when you do this when you have a vm and you're doing some reads and writes back and forth to it and showing some of the numbers and how things adjust on there ultimately it's functional to teach you how it can be done inside of trueness and you know academics so we can understand what's going on behind the scenes and i also will leave links to all the articles that i talk about in a forum post that i have it just dives into a lot of different topics of zfs i am a big zfs fan obviously for those of you that don't get the joke on the shirt i have been called a zfs cult member because i'm preaching about the wonderfulness of zfs and it is a wonderful file system so that's kind of the joke about the shirts yes there's a link to those in case you're curious or care if not you can skip right over that nonetheless before we get into the details of this video if you'd like to learn more about reading my company however laurentsystems.com if you like to hire for a project such as storage consulting there's a hires button right at the top if you like to support this channel other ways there's affiliate links down below to get your deals and discounts on products and services we talk about on this channel let's start by talking about how right caching works in zfs there are two types of rights asynchronous and synchronous an asynchronous write provides immediate confirmation when it receives a rate request even though the right is still pending and has not yet been committed to disk this frees up the application waiting for the confirmation and can provide a performance boost for applications with non-critical rights that's because it's lying to the application essentially it said the right happened even though it didn't now a synchronous right guarantees integrity by not providing confirmation of the right until the data has actually been written to the disk this type of write is used by consistency critical applications and protocols such as database virtual machines and nfs we'll do a vm and nfs demos later in this video all pending rights are stored in the zfs write cache this is located in ram which is volatile meaning its contents are lost when the system reboots or loses power to maintain data integrity for synchronous rights each pool has its own zfs intent log or zill residing on a small area of the storage disk pending synchronous rights are stored in ram and also logged into zil simultaneously only for synchronous rights zfs is a transactional file system it writes to storage using transactional groups which are created in a set every five seconds these groups are atomic meaning the entire contents of a transaction group must be committed to disk before it is considered a complete write that is the copy and write mechanism that i discussed in my video where i said cfs as a cow linked down below so what happens if a failure such as a power loss or kernel panic occurs everything in ram including all pending transactions and asynchronous rights requests are gone if there was an interruption in the transaction group performing a right that transaction group is incomplete and the data on disk is now out of date by five seconds which can be a pretty big deal on a busy server but keep in mind that pending synchronous rights are still in the zil on a system that's on startup the zfs will read the zil and replay any of those pending transactions in order to commit those rights to disk it's actually a pretty solid system here pending synchronous rights still get written without any losses no more than five seconds though of asynchronous rights would have been lost five seconds as i said is a big deal but just so we're clear on its only asynchronous site's loss because it replaced those transactions but this comes at the cost of performance so let's talk about performance and where that zeal date is stored for applications using synchronous rights while having zill reside on the storage disk can result in poor performance because the zil writes and reads must compete with all the other disk activity this can especially be a problem on a system with a lot of small random rights the solution to this issue if your workload requires synchronish rights is to move that zill to a dedicated log v dev and there's a few things to keep in mind if you want to do this the log device requires at least one dedicated physical device that is only used for the zil you should mirror these devices because safety and redundancy in case one of them were to fail the devices should be really fast but they do not have to be very big the zil only needs to have enough capacity to contain the synchronous rights that have not been flushed from the ram to the disk this process occurs every five seconds so how much data is that and let's use a 10 gigabit connection to your trunas server as an example with a max 10 gigabit connection the maximum possible thorough put ignoring overheads and assuming one direction in a perfect situation would be 1.25 gigabytes per second and because that flash happens every 5 seconds the most you would see written to your log device is five times 1.25 totaling 6.25 gigs that's it you're hearing that correct only 6.25 gig max would be the absolute most you would write to that log device so if you bought a one terabyte one you're probably not going to use all of that log device this is as i said why so many of them are so much smaller now let's talk about zfs read cache of which there are two the first one is arc which stands for adaptive replacement cache and it is a complex caching algorithm that tracks both blocks and cache blocks recently evicted from cache to figure out what to cache arc is all in ram which leads people to thinking zfs is memory hungry but it's not zfs just does not let any unused memory go to waste it also should be noted that arc is a native part of how zfs works and the more memory it has to work with the better it performs at caching now the l2 arc or cachevdev is a storage device you attach to your pool that is a much simpler system compared to the arc allowing for efficient write operations at the expense of hit ratios it's kind of a simple caching system of note l2 arc will rarely have a cash hit ratio as high as the arc this is as expected but not a bug l2 arc is just a simpler cheaper way of caching more data than the arc can fit the most demanded blocks are always going to be available in arc and the l2 arc is just a catch-all for some of the marginal blocks hence the lower hit ratios that are on there now the real question comes so when should you use an l2 arc for read cache for most users the answer to this question is simple you shouldn't that's it the l2 arc needs system ram to index it which means that the l2 arc comes slightly at the expense of memory 4 arc since arc is an order magnitude or so faster than the l2 arc and uses a much better cache algorithm you need a rather large repetitive set of data to be requested that exceeds the arc for the l2 arc to become worth having so if your goal is to have a fast cache for frequently accessed blocks buy more memory for that system it's the best investment you have this is why high performance zfs systems always have so much memory in them it is also worth noting that if you have a write intensive workload or don't frequently request the same data it's also not very useful so the goal is always to have as much memory as possible that's where you're going to get the best read performance on there it's really that simple there's not anything more complicated than that but if you do have a workload that exceeds your ram or you have maxed out capacity ram then maybe it's time to consider an arc device now of note you do not have to put these in in pairs in the system because they're only caching data that already exist on these storage v devs therefore if they were lost if a failure would occur it's inconvenient but not catastrophic because if that data is not there it just pulls it back from the drives like normal now for the fun part playing with this demo lab that we have set up running truenast 12.0 u8 essentially just works the same in trueness core as it does in scale our demo system here happens to be a core system now we have 15.9 gigs of ram but this is really not an overall performance system i want to bring that up because we're not going to be running a whole lot of benchmarks we're just going to be talking about functional things and how to get them done in here zfs cache because i have a vm running on this attached over nfs zfs has decided to cache things with the arc as we talked about it's not that it's ram hungry it'll use as much ram as well is available in this system 15.9 or 16 gigs of ram being available means the cash is currently able to use 11.7 and yes in case you're wondering based on services running on this you will end up dynamically if i add more services or install something like a gl on here it just automatically resizes and scales it it's only using what it perceives as free memory not used by other services running so that's dynamic and nothing you really have to worry about in case you're wondering now the system itself is set up with nfs attached to xcpng i have a demo system right here and i have it up live and running and is attached as i said with nfs and we have sync disabled so let me cover that setting real quick here we're going to go to storage we're going to go to pools click on little buttons there we're going to edit the options and we can see sync is disabled and we're going to run a simple test we're ssh into these the top one is the ubuntu one and down below and probably actually show the command because people wonder we're just running z pool i o stat dash v the name of the pool which is the lab pool and one means refresh every one second gives us a really simple it is actually just keeps rereading and telling us what the stats are we're going to actually make it a little bigger so you can kind of watch this populate and we're going to run just some fio but specifically i have this setup to run fio with a write test so we're just going to write about a gig of data here and see what happens so here we see it writing and you can see down here the data as it's refreshing all right we're bringing quite a bit of data here to the drive 1.4 you can see by the way these are also just raid z1 and we have three sata ssds in here not a particularly fast system but good enough so utilization queue merging out we're writing at about 93.1 megs a second so right there's our rights and if we go back over to the system here and switch over to here and we're going to look at how it perceives this there's our disk writes that we just did and we can see the thorough put as it goes through there so not bad not great but not bad either for what it's writing now on the fly you can change this we're going to sync always but by the way we do not have currently installed a slog device but let's go ahead and run that same test again after we do this so we're going from 93.1 megs we're just going to rerun the same command and let's see how fast we can write to it with async always turned on looks like we're writing at about 13 megs instead of 70 mags this is that force commit now when you have the sync disabled it's essentially lying to the xcpng system it says oh no we have that right absolutely committed so because we're saying no don't tell xc png or whatever's writing to this particular nfs mount until we actually sync it don't give the system call to send more data because we're not ready we're not syncing it yet this just results in extremely low performance and even with a lot of drives this is where you run into the need for a zil or to turn the sync off so currently with this 21 mags 17 mags it's going to take a little while for this to finish and uh it's because it's writing so slow let's go back over to here in true nas and now we're going to talk about adding and we have in our disk pool here so if we look at the disk there's those three satas that we're using for the drives called the lab pool and here is this mvme drive we put in we happen to have a small one laying around and it's snapped into the system so we're going back over to the pool and by the way we can do this while it's reading writing we don't have to shut anything down we want to go here we want to add a v dev got to choose the type here of log finder nvme drive go here and uh that's it you're going to get a warning a straight log v dev may result in data loss if it fails combined with power as soon as you tell you don't have just one so we're gonna have to say the word force they really want you to know that if you do not commit these in pairs there is that potential that you'll lose this and lose the ram if there's a catastrophic failure you can have a problem so they're reminding you of that that these should be set up in pairs we're going ahead and add the v dev uh added discs our erase and the pool's extend yeah we're just letting know that this block device that we're adding is going to do that actually let's check real quick it did finish writing and you can see it's written rather slow but let's see if it's finished and we'll double check and show this screen because we're going to do this and hit add v devs or jump over here and just like that look we've added it and now it shows up right here as the log device so here it is it's got 13 gigs but back to what i said earlier about it doesn't have to be that big which is fine makes this a little bit more ideal now there's small rights and transactions that the system is going to do with this vm so there's going to be a little bit of data that you're going to see here popping it out but let's go ahead and run that right test again and see if we can beat our 17 megs so run fio and let's see if these little writes make a difference hey look we got some data going over here and we're not quite up to the performance and this comes down to the system so we're hitting in the 70s but we're up to close to 60. so you can see the immediate effect of adding this definitely improved our rate performance not to the word it was but still reasonably good so yes absolutely we were improved and it's now committed and flushed out some of that transactional data so it sits with about 362 bags left even though the file we are testing is one gig as i said it's not exactly a one-to-one ratio for the size of the file you create versus what ends up in the intention log and as it kind of expires out commits those and we're back to don't really have anything left all right go back over here and let's remove it and you can add these or you can remove them we go to status from clicking on the pool click on this and we can just drop it confirm you want to drop this absolutely and uh it's gone well almost all right now it is got a little ahead of myself there that's it now we've removed it if we go back over here you can see it doesn't show up anymore so switch back here go to pools same process again that added as a cache drive now before we add it as a cache drive let's go ahead and do this we're going to then run fio i don't care about a write test anymore so we're going to block out this this is just a random read test fio is just a really simple utility that's why i'm using it it's not in depth and this is not accurate or benchmarking just to give you some general ideas of what happens so if we run this again and we look down here you're going to see much heavier writes and we'll see where we get with this so it's doing the rights going to take a few seconds to complete but we're doing this at about 63 000 iops and we should get a speed test at the end it should tell us about how fast this is with the random writes or we can even go and look at this because this is what's going to be in there i'm sorry random reads because we're doing this is a random read test if we go over here you'll actually see which is kind of interesting even though we're writing at these slower speeds you can see it's able to get a substantially higher thermal put now the reason for this is really simple and we'll go ahead and switch back and forth to it i even put a larger file in here but it's also part of the way the adaptive recache works so if we go back in here then run fio if you can read through where it's wrapping right here we actually size this to be a 10 gig file but even though it's a bunch of random reads from a 10 gig file if you notice at the beginning here i had mentioned there is 16 gigs of ram and 11.7 gigs of it is all for this cache this is why you don't see a ton of reads and write on the disk activity you're seeing basically a massive amount of this reading going on in terms of pulling it right from the arc this is ideal this is not necessarily making it easy for me to benchmark it because it seems to be an unrealistic number you're like time you're not measuring your drive speed i'm like well do we have to don't we want to see the efficiency of things that are going on in a vm and isn't running a vm frequently asking for a lot of the same data so the more memory you stuff in here the better it is because that's where that as they referred to as hot memory is going to say yes those are all the objects that i put in there this is that hot demand data that you're saying that should be in here so you keep requesting it let me keep giving it to you let's go over here to the reporting and if we look at what's going on we switch over to zfs this is how you end up with these arc hit ratios that are really really good so here we go hit versus miss 512 000 hits versus 5 000 misses and these are you know quite good numbers and what you're hoping for now let's go ahead on the fly and maybe you're noticing like hey what are these over here where it says l2 that's where i was doing a demo before i hit record so let's go back over to the storage pools and now that i have record on and i'm doing it in for this demo we're going to add a cache feed of same process add it over here and these it does not need to be in pairs or redundant the reason why is really simple it's just data that's read from your existing storage pool pulled into the cache so no problem it's uh not a big deal if it's lost it's just annoying and it will just repopulate that as in each time the system if it loses it it can go oh i know where that was it was just pulled and put in there so now we've added this and now we see that the cache one is here now that we know that this random read test let's see what happens now now currently a little bit of memory is going in there and it's just the frequent writes remember this is a very basic the l2 arc is a more basic cache and it's going all right i'll put some of this data in here just from the running vm but if we run this it's going to populate it a lot faster so we're going to start seeing the rights kind of populate in but we're probably not going to see a massive change at all in performance matter of fact the iops is roughly the same if we scroll back in here it's not substantially different here and the reason why is because it's just pulling still from the cache now we have to fully exceed the cache for to start the arc cache in order for the l2 arc to start being effective but there's still at least some data being swapped out and populated in here and that data will just stay there as part of the data and this can be beneficial as i said in certain workloads such as if you may need a cache if you have lots of large files that many people are requesting that large file if it's larger than the arc can end up in likely to be hit on as a successful cache on the l2 arc but overall the reality is the performance differences are massive between memory ram and a block device storage so if you can if it's better for your budget you can uh well better for your performance not necessarily for your budget as you put more memory in the system and this will allow it to have a better caching but the caching it still works the adaptive one still puts some data here but it's going to take a little while before it exceeds the memory so we have to exhaust arc in there i don't really have a demo i guess i could try one more test where we just make this file big so let's go ahead and do this and see where it goes if we said we're going to write a file not 10 gigs but let's say 25 gigs let's go ahead and run this again with the 25 gig test which should depending on if it's a different amount of data it's supposed to be random file it creates here hopefully it will exceed the art cache and actually start pulling a little bit more and we'll see some read operations let's scroll this up a little bit but you'll see some more read operations on this cache because as you can see from these tests right now it's just not hitting the read operations on it so this larger file did result in a different change we have 169 seconds here so a little bit slower than our if we scroll up the 295 here if we look at this under xcpng we can see this test that we did here then here is the writing out of that large file and back over to here at the end it kind of peaked out again probably when it was pulling it out of arc so the first part was very random and this is probably more realistic but once we got to the part where it was repetitive and arc was going nope i have this and it was able to send all the data we see it ramping back up because it was asking for repetitive data now as i said in the beginning i'll leave links to well many of the other topics i've talked about around zfs and planning and storage pools and zfs being a cow and well the whole list of articles i have over in my forums because there's still a lot more to understand but i hope this video gave you some understanding better than you had before about how some of this works it is a complicated topic it is one of the reasons it's misunderstood is due to the complexities of it and understanding performance of file systems is a little tricky but it's still what makes cfs such an awesome system and i just love the fact that we can do this on the fly not rebooting or anything just going hey let's add parts of these block devices remove them and understand a little bit better how they work i really encourage playing with it in your own lab this is how i gained a better understanding it before i started doing this commercially it's one of those things of why i like spending time talking about it because well i think we need more people in history that understand it and it's just a fascinating topic and i imagine a lot of you if you've made it to the end of this video it was a very interesting topic to you as well as always if you want to have a more in-depth discussion about this and over to my forums and see in the next video thanks and thank you for making it all the way to the end of this video if you've enjoyed the content please give us a thumbs up if you would like to see more content from this channel hit the subscribe button and the bell icon if you'd like to hire a short project head over to lawrences.com and click the hires button right at the top to help this channel out in other ways there's a join button here for youtube and a patreon page where your support is greatly appreciated for deals discounts and offers check out our affiliate links in the description of all of our videos including a link to our shirt store where we have a wide variety of shirts that we sell and designs come out well randomly so check back frequently and finally our forums forums.laurensystems.com is where you can have a more in-depth discussion about this video and other tech topics covered on this channel thanks again for watching and look forward to hearing from you you
Info
Channel: Lawrence Systems
Views: 78,219
Rating: undefined out of 5
Keywords: LawrenceSystems, zfs file system, l2arc, l2arc truenas, l2arc vs slog, l2arc hit ratio, l2arc vs zil, l2arc tuning, l2arc metadata, zfs l2arc, zfs slog, zfs slog vs l2arc, zfs slog ssd, zfs slog drive, zfs slog optane, zfs slog device, zfs slog nvme, zfs file system explained, ZFS Cache, ZFS Write cache, ZFS Read Cache
Id: M4DLChRXJog
Channel Id: undefined
Length: 25min 8sec (1508 seconds)
Published: Wed Mar 16 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.