All about POOLS | Proxmox + Ceph Hyperconverged Cluster fäncy Configurations for RBD

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello everyone! Everyone's favorite cluster is  back and it's bigger than ever! These guys here are my 250 dollar hyper-converged proxmox  and ceph cluster and they're getting a new best friend today, some spinning rust So today we are going to talk about a number of topics more advanced in the  proxmox and rados block device system such as mixing solid-state drives along with hard  drives changing the failure domain this thing is the only node that has hard drives so if you want  to store data on hard drives then we're going to have to tolerate the fact that if this node  goes down we're going to lose all of those pools and we're also going to talk about SSD accelerated  pools for virtual machines using separate metadata and data pools I'm still just focusing on RBD  or rados block device for this video that's the underlying storage you use for virtual machines  and containers We're not going to talk about CephFS the file system yet, but I'll get to  it eventually I promise so come along with me on this journey as we explore the ins and outs of  proxmox pools, metadata, data, all of that fun stuff So, since the last video, I reinstalled everything so I'm now running proximox 7.3 and I added big store to the cluster That is the HP microserver. I already have the three thin clients, that makes four I am going to speed run setting  up the Ceph cluster and if you haven't watched my previous video go up there and watch that I'll show in detail in that video how to set up Ceph So we're back to roughly where we were at the  end of the last video so I have my four nodes they are in a proxbox cluster I've installed Ceph  on top, I'm using Ceph 17 this time, because it came out since my last video and I've added all my osds so my three Thin Client nodes each have their 128 gig flash drive so that's OSD 0 1 and 2 and then my big store has four hard drives which are two terabytes each and the almost half terabyte  SSD so that's all I've done so far, I've added the osds, I added the manager which is here, and we're  going to use the manager today to play with things so in the previous video when I told you to  have pools we came here we went to proxmox   Ceph pools create, and we created our pool with a size of three but there's more to it than just what proxmox gives us so instead of creating our pools and proxmox, we're going to use the ceph management GUI instead, because it has a lot more options   so we see basically the same information  here, so I have four nodes running monitors   I have one running a manager; you  don't need a manager in every node   I'm not doing file systems and you go to pools and  we have the manager pool so this is where we're going to go in the Ceph side to create pools so we can click create and we see a lot more options here, we can say erasure coded or replicated, we can add applications, we can choose different options and so this is how we're going to manage our  pools now when we're doing fancier things now is probably a good time to introduce failure domains you can think of the failure domain as the level at which Ceph will implement the redundancy we've  requested in a traditional RAID or ZFS system, you're probably used to building a single system and designing for the failure of individual drives   in Ceph, that would correspond to the OSD failure  domain since one OSD is usually equivalent to one  drive this tells Ceph that it needs to put the  redundant data on separate osds so failed OSD doesn't cause us to lose data by default the next level up from OSD is host this tells Ceph that we not only need to keep redundant data on separate  drives, but entirely separate hosts configuring this means we can now lose an entire server or  more depending on what our redundancy rules are and continue operating this is quite powerful  for high availability systems as now we can get high availability storage as we would with high  availability virtual machines if you build larger clusters or have a need for more fine-grained  failure domains you're free to create more; such as a shared chassis system a rack an aisle  Etc but that's out of the scope of this video so the first thing we're going to do is we're  going to add a rule that forces stuff to be stored   only on ssds. so for whatever reason we want to  store our VM disks on ssds maybe we want our boot drives to be on ssds and our data drives to be in  hard drives; whatever so we're going to call this so we give it a name we name whatever we want Pool type is going to be replicated, which is the same as we were doing before, and a replicated sizes of three so this will give us the same configuration we had in Proxmox in the last video which  means that all the data will be spread across three chunks and a minimum of two of them will  be required next we have to add an application and our application is usually going to be RBD  which is rados block device that is what we use for virtual machine disks anything that emulates  a hard drive and then we're going to create a new rule here instead of just using replicated  rule and we're going to call this SSD rule we'll call this replicated SSD rule   failure domain can either be OSD or host and device class, we could say let Ceph decide, or we can say SSD so with this rule in place it's saying that whatever we put on this pool using this rule must be stored  on separate hosts and must be stored in ssds there we go and create pool so now we have a pool creating it it is a replica  for RBD and we come over into Proxmox and add that as a storage so Datacenter -> Storage -> Add -> RBD so it already found it cause it's the only one here ProxLabSSD, and we're going to let it  be used for disk images and containers   and let's go so you might have noticed that  our storage efficiency isn't great so far storage efficiency is the percentage of our raw  disc capacity that we are actually able to use   with our three times replicated pools we  are keeping three copies of all of our data   this is fantastic for redundancy but unfortunately  it means our storage efficiency is only 33 percent   you might compare this to using a raid one with  three discs in a mirror thankfully Ceph has a solution to this: Erasure coding you can think of  this similarly to raid 5 and higher raid levels although it's much more flexible in Ceph with Erasure coding we take the original data block and we split it up into multiple shards then, using an algorithm known as Reed-Solomon Coding we compute additional shards containing error  correcting information as long as we have the original number of shards in good condition in any combination of original shards and error correcting shards we can recompute the entire original data block the number of shards to break the data into is known as K and the number of additional error correcting shards is known as M the total storage efficiency is K / ( K + M) we require at least (K + M) separate nodes in our chosen failure domain and we can tolerate  the failure of up to M of those nodes in larger clusters we can use very wide Erasure codes such  as 10 plus 4 to have good storage efficiency and still tolerate the failure of four nodes in small  clusters we can use the minimum code of 2 + 1 to get 66 percent storage efficiency and the  ability to tolerate the failure of a single node so we're going to create an Erasure  coated pool now we're going to call it ProxLabErasure make it an erasure coated  pool we need to check EC overwrites if we're using either RBD or CephFS, if you're using RGW RADOS Gateway then you don't need to use EC overwrites and our application is going to be RBD  again we're going to make a new profile   so if I want to create this on just my hard  drives the most I can do would be a three   plus one which means I can tolerate one  drive failure or I could do a two plus two   which means I can tolerate two Drive failures I'm  going to do a three plus one um because I like to   live Dangerously so uh yeah let's get on with this so failure domain I'm going to say is OSD because all of my hard drives are in the same  OSD and device class we're going to say HDD   which means we have four osds we're  using a two plus one Erasure code   let's go so what happens if we  try to add this guy into proxbox so we're going to go to Data Center  storage add RBD well it lets us pick it what happens if we try to use  it it's a trick question but uh   why I'm asking you guys pick the same settings  as the other one Ubuntu desktop VirtIO block   with discard yes we got an error and the error says  blah blah blah blah blah blah unable to create RBD operation not supported so why is that ? if you go to the documentation here it says erasure coded of pools do not support omap So to use them with RBD you must instruct them to store their data in an erasure coded pool and their metadata in a replicated pool this means using the Erasure  coded pool as the data pool   so we're going to do that so any pool will work  as our metadata pool it just has to be replicated and any data will work and any pool will work  as our data pool with any settings as long as   it has allow EC overrides on if it's Erasure  coded so if I go here where we had the error   well we got to delete it because it's not  not working so we'll add a new one for RBD ProxLabErasure but I need to pick a pool here  that is actually going to store the metadata because whatever I pick here in the GUI is what's going to store my metadata not my data so I'm going to pick the SSD pool and I already have  VMS in that pool and it's fine for these to share I just need to pick a pool that is replicated  so we're going to pick the SSD pool and add   it now if we were to just use this as is it  would still be storing our data on the SSD   pool which is not what we want so we need to  go into the config file and tell proxmox where   the data pool is where it should actually  put the data which is our prox lab Erasure   so go to the shell and we're going  to edit a file that's /etc/pve/storage.cfg so this file is all of the storage we have  in proxmox so I have my Nas which I use for ISO images I have local which comes in  by default, I have ProxLabSSD which we added I have ProxLabErasrure which we added so  here we see the pool field is proxlive SSD and if we go back to the Ceph documentation  it told us we need to have a data pool   and so the data pool needs to be our Erasure coded  pool which in this case is named proxlab Erasure   so I'm going to copy that name and  we're going to add a new field to   this file so we're going to tab over  and say data-pool ProxLabErasure and we will save that so now every time we try to  store something on proxlab Erasure it's going to   sort of the metadata about the existence of the  volume Etc on our SSD pool but the actual blocks   of data are going to get stored in our erasure  coded pool so if I try to create that vm301 again   looks like we were there were no errors so I hope you guys enjoyed that walkthrough of fancy pools in Ceph at least fancy for rados Block device for the curious I ran through the full Ubuntu desktop installer on both of my virtual machines the replicated pool and the Erasure coated pool, both of them came out to 11.4 gigabytes of VM storage  space which for the replicated pool meant it took three times that and for the Erasure coated pool  as expected it took one and a half times that I also looked at how much metadata it was using  and the metadata was only two kilobytes for that   so you don't need much space in your metadata  pool at least not for RBD so next up in this series I'm planning on diving into ceph FS the  file system and man is that a big topic much more complicated than rados block device that's  why we did RBD first there is one topic I know I'm going to get questions about so I'm going to  address it right away and that is cache tiering yes ceph does support cache tiering of pools yes you can use them for RBD but the documentation suggests that it's not a great use case and they have some reasons why if you read the documentation so I'm going to leave cache tiering for a future  video probably on rados Gateway or CephFS as always don't forget to like And subscribe for all  my future videos especially that CephFS content I've mentioned that's coming up eventually... someday... when I get around to it I have a Discord server down below which you can find if you like to  chat with me any questions about any of this   I have a website here feel free to read it and  as always I will see you on the next adventure
Info
Channel: apalrd's adventures
Views: 24,040
Rating: undefined out of 5
Keywords:
Id: nyhIqewyDBk
Channel Id: undefined
Length: 14min 14sec (854 seconds)
Published: Thu Dec 29 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.