Hello everyone! Everyone's favorite cluster is
back and it's bigger than ever! These guys here are my 250 dollar hyper-converged proxmox
and ceph cluster and they're getting a new best friend today, some spinning rust So today we are going to talk about a number of topics more advanced in the
proxmox and rados block device system such as mixing solid-state drives along with hard
drives changing the failure domain this thing is the only node that has hard drives so if you want
to store data on hard drives then we're going to have to tolerate the fact that if this node
goes down we're going to lose all of those pools and we're also going to talk about SSD accelerated
pools for virtual machines using separate metadata and data pools I'm still just focusing on RBD
or rados block device for this video that's the underlying storage you use for virtual machines
and containers We're not going to talk about CephFS the file system yet, but I'll get to
it eventually I promise so come along with me on this journey as we explore the ins and outs of
proxmox pools, metadata, data, all of that fun stuff So, since the last video, I reinstalled everything so I'm now running proximox 7.3 and I added big store to the cluster That is the HP microserver. I already have the three thin clients, that makes four I am going to speed run setting
up the Ceph cluster and if you haven't watched my previous video go up there and watch that I'll show in detail in that video how to set up Ceph So we're back to roughly where we were at the
end of the last video so I have my four nodes they are in a proxbox cluster I've installed Ceph
on top, I'm using Ceph 17 this time, because it came out since my last video and I've added all my osds so my three Thin Client nodes each have their 128 gig flash drive so that's OSD 0 1 and 2 and then my big store has four hard drives which are two terabytes each and the almost half terabyte
SSD so that's all I've done so far, I've added the osds, I added the manager which is here, and we're
going to use the manager today to play with things so in the previous video when I told you to
have pools we came here we went to proxmox Ceph pools create, and we created our pool with a size of three but there's more to it than just what proxmox gives us so instead of creating our pools and proxmox, we're going to use the ceph management GUI instead, because it has a lot more options so we see basically the same information
here, so I have four nodes running monitors I have one running a manager; you
don't need a manager in every node I'm not doing file systems and you go to pools and
we have the manager pool so this is where we're going to go in the Ceph side to create pools so we can click create and we see a lot more options here, we can say erasure coded or replicated, we can add applications, we can choose different options and so this is how we're going to manage our
pools now when we're doing fancier things now is probably a good time to introduce failure domains you can think of the failure domain as the level at which Ceph will implement the redundancy we've
requested in a traditional RAID or ZFS system, you're probably used to building a single system and designing for the failure of individual drives in Ceph, that would correspond to the OSD failure
domain since one OSD is usually equivalent to one drive this tells Ceph that it needs to put the
redundant data on separate osds so failed OSD doesn't cause us to lose data by default the next level up from OSD is host this tells Ceph that we not only need to keep redundant data on separate
drives, but entirely separate hosts configuring this means we can now lose an entire server or
more depending on what our redundancy rules are and continue operating this is quite powerful
for high availability systems as now we can get high availability storage as we would with high
availability virtual machines if you build larger clusters or have a need for more fine-grained
failure domains you're free to create more; such as a shared chassis system a rack an aisle
Etc but that's out of the scope of this video so the first thing we're going to do is we're
going to add a rule that forces stuff to be stored only on ssds. so for whatever reason we want to
store our VM disks on ssds maybe we want our boot drives to be on ssds and our data drives to be in
hard drives; whatever so we're going to call this so we give it a name we name whatever we want Pool type is going to be replicated, which is the same as we were doing before, and a replicated sizes of three so this will give us the same configuration we had in Proxmox in the last video which
means that all the data will be spread across three chunks and a minimum of two of them will
be required next we have to add an application and our application is usually going to be RBD
which is rados block device that is what we use for virtual machine disks anything that emulates
a hard drive and then we're going to create a new rule here instead of just using replicated
rule and we're going to call this SSD rule we'll call this replicated SSD rule failure domain can either be OSD or host and device class, we could say let Ceph decide, or we can say SSD so with this rule in place it's saying that whatever we put on this pool using this rule must be stored
on separate hosts and must be stored in ssds there we go and create pool so now we have a pool creating it it is a replica
for RBD and we come over into Proxmox and add that as a storage so Datacenter -> Storage -> Add ->
RBD so it already found it cause it's the only one here ProxLabSSD, and we're going to let it
be used for disk images and containers and let's go so you might have noticed that
our storage efficiency isn't great so far storage efficiency is the percentage of our raw
disc capacity that we are actually able to use with our three times replicated pools we
are keeping three copies of all of our data this is fantastic for redundancy but unfortunately
it means our storage efficiency is only 33 percent you might compare this to using a raid one with
three discs in a mirror thankfully Ceph has a solution to this: Erasure coding you can think of
this similarly to raid 5 and higher raid levels although it's much more flexible in Ceph with Erasure coding we take the original data block and we split it up into multiple shards then, using an algorithm known as Reed-Solomon Coding we compute additional shards containing error
correcting information as long as we have the original number of shards in good condition in any combination of original shards and error correcting shards we can recompute the entire original data block the number of shards to break the data into is known as K and the number of additional error correcting shards is known as M the total storage efficiency is K / ( K + M) we require at least (K + M) separate nodes in our chosen failure domain and we can tolerate
the failure of up to M of those nodes in larger clusters we can use very wide Erasure codes such
as 10 plus 4 to have good storage efficiency and still tolerate the failure of four nodes in small
clusters we can use the minimum code of 2 + 1 to get 66 percent storage efficiency and the
ability to tolerate the failure of a single node so we're going to create an Erasure
coated pool now we're going to call it ProxLabErasure make it an erasure coated
pool we need to check EC overwrites if we're using either RBD or CephFS, if you're using RGW RADOS Gateway then you don't need to use EC overwrites and our application is going to be RBD
again we're going to make a new profile so if I want to create this on just my hard
drives the most I can do would be a three plus one which means I can tolerate one
drive failure or I could do a two plus two which means I can tolerate two Drive failures I'm
going to do a three plus one um because I like to live Dangerously so uh yeah let's get on with this so failure domain I'm going to say is OSD because all of my hard drives are in the same
OSD and device class we're going to say HDD which means we have four osds we're
using a two plus one Erasure code let's go so what happens if we
try to add this guy into proxbox so we're going to go to Data Center
storage add RBD well it lets us pick it what happens if we try to use
it it's a trick question but uh why I'm asking you guys pick the same settings
as the other one Ubuntu desktop VirtIO block with discard yes we got an error and the error says
blah blah blah blah blah blah unable to create RBD operation not supported so why is that ?
if you go to the documentation here it says erasure coded of pools do not support omap So to use them with RBD you must instruct them to store their data in an erasure coded pool and their metadata in a replicated pool this means using the Erasure
coded pool as the data pool so we're going to do that so any pool will work
as our metadata pool it just has to be replicated and any data will work and any pool will work
as our data pool with any settings as long as it has allow EC overrides on if it's Erasure
coded so if I go here where we had the error well we got to delete it because it's not
not working so we'll add a new one for RBD ProxLabErasure but I need to pick a pool here
that is actually going to store the metadata because whatever I pick here in the GUI is what's going to store my metadata not my data so I'm going to pick the SSD pool and I already have
VMS in that pool and it's fine for these to share I just need to pick a pool that is replicated
so we're going to pick the SSD pool and add it now if we were to just use this as is it
would still be storing our data on the SSD pool which is not what we want so we need to
go into the config file and tell proxmox where the data pool is where it should actually
put the data which is our prox lab Erasure so go to the shell and we're going
to edit a file that's /etc/pve/storage.cfg so this file is all of the storage we have
in proxmox so I have my Nas which I use for ISO images I have local which comes in
by default, I have ProxLabSSD which we added I have ProxLabErasrure which we added so
here we see the pool field is proxlive SSD and if we go back to the Ceph documentation
it told us we need to have a data pool and so the data pool needs to be our Erasure coded
pool which in this case is named proxlab Erasure so I'm going to copy that name and
we're going to add a new field to this file so we're going to tab over
and say data-pool ProxLabErasure and we will save that so now every time we try to
store something on proxlab Erasure it's going to sort of the metadata about the existence of the
volume Etc on our SSD pool but the actual blocks of data are going to get stored in our erasure
coded pool so if I try to create that vm301 again looks like we were there were no errors so I hope you guys enjoyed that walkthrough of fancy pools in Ceph at least fancy for rados Block device for the curious I ran through the full Ubuntu desktop installer on both of my virtual machines the replicated pool and the Erasure coated pool, both of them came out to 11.4 gigabytes of VM storage
space which for the replicated pool meant it took three times that and for the Erasure coated pool
as expected it took one and a half times that I also looked at how much metadata it was using
and the metadata was only two kilobytes for that so you don't need much space in your metadata
pool at least not for RBD so next up in this series I'm planning on diving into ceph FS the
file system and man is that a big topic much more complicated than rados block device that's
why we did RBD first there is one topic I know I'm going to get questions about so I'm going to
address it right away and that is cache tiering yes ceph does support cache tiering of pools yes you can use them for RBD but the documentation suggests that it's not a great use case and they have some reasons why if you read the documentation so I'm going to leave cache tiering for a future
video probably on rados Gateway or CephFS as always don't forget to like And subscribe for all
my future videos especially that CephFS content I've mentioned that's coming up eventually... someday... when I get around to it I have a Discord server down below which you can find if you like to
chat with me any questions about any of this I have a website here feel free to read it and
as always I will see you on the next adventure