All about POOLS | Proxmox + Ceph Hyperconverged Cluster fäncy Configurations for RBD

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello everyone! Everyone's favorite cluster is back and it's bigger than ever! These guys here are my 250 dollar hyper-converged proxmox and ceph cluster and they're getting a new best friend today, some spinning rust So today we are going to talk about a number of topics more advanced in the proxmox and rados block device system such as mixing solid-state drives along with hard drives changing the failure domain this thing is the only node that has hard drives so if you want to store data on hard drives then we're going to have to tolerate the fact that if this node goes down we're going to lose all of those pools and we're also going to talk about SSD accelerated pools for virtual machines using separate metadata and data pools I'm still just focusing on RBD or rados block device for this video that's the underlying storage you use for virtual machines and containers We're not going to talk about CephFS the file system yet, but I'll get to it eventually I promise so come along with me on this journey as we explore the ins and outs of proxmox pools, metadata, data, all of that fun stuff So, since the last video, I reinstalled everything so I'm now running proximox 7.3 and I added big store to the cluster That is the HP microserver. I already have the three thin clients, that makes four I am going to speed run setting up the Ceph cluster and if you haven't watched my previous video go up there and watch that I'll show in detail in that video how to set up Ceph So we're back to roughly where we were at the end of the last video so I have my four nodes they are in a proxbox cluster I've installed Ceph on top, I'm using Ceph 17 this time, because it came out since my last video and I've added all my osds so my three Thin Client nodes each have their 128 gig flash drive so that's OSD 0 1 and 2 and then my big store has four hard drives which are two terabytes each and the almost half terabyte SSD so that's all I've done so far, I've added the osds, I added the manager which is here, and we're going to use the manager today to play with things so in the previous video when I told you to have pools we came here we went to proxmox Ceph pools create, and we created our pool with a size of three but there's more to it than just what proxmox gives us so instead of creating our pools and proxmox, we're going to use the ceph management GUI instead, because it has a lot more options so we see basically the same information here, so I have four nodes running monitors I have one running a manager; you don't need a manager in every node I'm not doing file systems and you go to pools and we have the manager pool so this is where we're going to go in the Ceph side to create pools so we can click create and we see a lot more options here, we can say erasure coded or replicated, we can add applications, we can choose different options and so this is how we're going to manage our pools now when we're doing fancier things now is probably a good time to introduce failure domains you can think of the failure domain as the level at which Ceph will implement the redundancy we've requested in a traditional RAID or ZFS system, you're probably used to building a single system and designing for the failure of individual drives in Ceph, that would correspond to the OSD failure domain since one OSD is usually equivalent to one drive this tells Ceph that it needs to put the redundant data on separate osds so failed OSD doesn't cause us to lose data by default the next level up from OSD is host this tells Ceph that we not only need to keep redundant data on separate drives, but entirely separate hosts configuring this means we can now lose an entire server or more depending on what our redundancy rules are and continue operating this is quite powerful for high availability systems as now we can get high availability storage as we would with high availability virtual machines if you build larger clusters or have a need for more fine-grained failure domains you're free to create more; such as a shared chassis system a rack an aisle Etc but that's out of the scope of this video so the first thing we're going to do is we're going to add a rule that forces stuff to be stored only on ssds. so for whatever reason we want to store our VM disks on ssds maybe we want our boot drives to be on ssds and our data drives to be in hard drives; whatever so we're going to call this so we give it a name we name whatever we want Pool type is going to be replicated, which is the same as we were doing before, and a replicated sizes of three so this will give us the same configuration we had in Proxmox in the last video which means that all the data will be spread across three chunks and a minimum of two of them will be required next we have to add an application and our application is usually going to be RBD which is rados block device that is what we use for virtual machine disks anything that emulates a hard drive and then we're going to create a new rule here instead of just using replicated rule and we're going to call this SSD rule we'll call this replicated SSD rule failure domain can either be OSD or host and device class, we could say let Ceph decide, or we can say SSD so with this rule in place it's saying that whatever we put on this pool using this rule must be stored on separate hosts and must be stored in ssds there we go and create pool so now we have a pool creating it it is a replica for RBD and we come over into Proxmox and add that as a storage so Datacenter -> Storage -> Add -> RBD so it already found it cause it's the only one here ProxLabSSD, and we're going to let it be used for disk images and containers and let's go so you might have noticed that our storage efficiency isn't great so far storage efficiency is the percentage of our raw disc capacity that we are actually able to use with our three times replicated pools we are keeping three copies of all of our data this is fantastic for redundancy but unfortunately it means our storage efficiency is only 33 percent you might compare this to using a raid one with three discs in a mirror thankfully Ceph has a solution to this: Erasure coding you can think of this similarly to raid 5 and higher raid levels although it's much more flexible in Ceph with Erasure coding we take the original data block and we split it up into multiple shards then, using an algorithm known as Reed-Solomon Coding we compute additional shards containing error correcting information as long as we have the original number of shards in good condition in any combination of original shards and error correcting shards we can recompute the entire original data block the number of shards to break the data into is known as K and the number of additional error correcting shards is known as M the total storage efficiency is K / ( K + M) we require at least (K + M) separate nodes in our chosen failure domain and we can tolerate the failure of up to M of those nodes in larger clusters we can use very wide Erasure codes such as 10 plus 4 to have good storage efficiency and still tolerate the failure of four nodes in small clusters we can use the minimum code of 2 + 1 to get 66 percent storage efficiency and the ability to tolerate the failure of a single node so we're going to create an Erasure coated pool now we're going to call it ProxLabErasure make it an erasure coated pool we need to check EC overwrites if we're using either RBD or CephFS, if you're using RGW RADOS Gateway then you don't need to use EC overwrites and our application is going to be RBD again we're going to make a new profile so if I want to create this on just my hard drives the most I can do would be a three plus one which means I can tolerate one drive failure or I could do a two plus two which means I can tolerate two Drive failures I'm going to do a three plus one um because I like to live Dangerously so uh yeah let's get on with this so failure domain I'm going to say is OSD because all of my hard drives are in the same OSD and device class we're going to say HDD which means we have four osds we're using a two plus one Erasure code let's go so what happens if we try to add this guy into proxbox so we're going to go to Data Center storage add RBD well it lets us pick it what happens if we try to use it it's a trick question but uh why I'm asking you guys pick the same settings as the other one Ubuntu desktop VirtIO block with discard yes we got an error and the error says blah blah blah blah blah blah unable to create RBD operation not supported so why is that ? if you go to the documentation here it says erasure coded of pools do not support omap So to use them with RBD you must instruct them to store their data in an erasure coded pool and their metadata in a replicated pool this means using the Erasure coded pool as the data pool so we're going to do that so any pool will work as our metadata pool it just has to be replicated and any data will work and any pool will work as our data pool with any settings as long as it has allow EC overrides on if it's Erasure coded so if I go here where we had the error well we got to delete it because it's not not working so we'll add a new one for RBD ProxLabErasure but I need to pick a pool here that is actually going to store the metadata because whatever I pick here in the GUI is what's going to store my metadata not my data so I'm going to pick the SSD pool and I already have VMS in that pool and it's fine for these to share I just need to pick a pool that is replicated so we're going to pick the SSD pool and add it now if we were to just use this as is it would still be storing our data on the SSD pool which is not what we want so we need to go into the config file and tell proxmox where the data pool is where it should actually put the data which is our prox lab Erasure so go to the shell and we're going to edit a file that's /etc/pve/storage.cfg so this file is all of the storage we have in proxmox so I have my Nas which I use for ISO images I have local which comes in by default, I have ProxLabSSD which we added I have ProxLabErasrure which we added so here we see the pool field is proxlive SSD and if we go back to the Ceph documentation it told us we need to have a data pool and so the data pool needs to be our Erasure coded pool which in this case is named proxlab Erasure so I'm going to copy that name and we're going to add a new field to this file so we're going to tab over and say data-pool ProxLabErasure and we will save that so now every time we try to store something on proxlab Erasure it's going to sort of the metadata about the existence of the volume Etc on our SSD pool but the actual blocks of data are going to get stored in our erasure coded pool so if I try to create that vm301 again looks like we were there were no errors so I hope you guys enjoyed that walkthrough of fancy pools in Ceph at least fancy for rados Block device for the curious I ran through the full Ubuntu desktop installer on both of my virtual machines the replicated pool and the Erasure coated pool, both of them came out to 11.4 gigabytes of VM storage space which for the replicated pool meant it took three times that and for the Erasure coated pool as expected it took one and a half times that I also looked at how much metadata it was using and the metadata was only two kilobytes for that so you don't need much space in your metadata pool at least not for RBD so next up in this series I'm planning on diving into ceph FS the file system and man is that a big topic much more complicated than rados block device that's why we did RBD first there is one topic I know I'm going to get questions about so I'm going to address it right away and that is cache tiering yes ceph does support cache tiering of pools yes you can use them for RBD but the documentation suggests that it's not a great use case and they have some reasons why if you read the documentation so I'm going to leave cache tiering for a future video probably on rados Gateway or CephFS as always don't forget to like And subscribe for all my future videos especially that CephFS content I've mentioned that's coming up eventually... someday... when I get around to it I have a Discord server down below which you can find if you like to chat with me any questions about any of this I have a website here feel free to read it and as always I will see you on the next adventure

Info

Channel: apalrd's adventures

Views: 24,040

Rating: undefined out of 5

Keywords:

Id: nyhIqewyDBk

Channel Id: undefined

Length: 14min 14sec (854 seconds)

Published: Thu Dec 29 2022