$250 Proxmox Cluster gets HYPER-CONVERGED with Ceph! Basic Ceph, RADOS, and RBD for Proxmox VMs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so you guys loved my hundred dollar three node proxmox cluster here these dell wise 50 60 thin clients but each of them only has a 16 gig ssd built in and at the time they only had four gigs of ram so today we're going to fix both of those problems the low ram and the storage cost and we're going to do that using 128 gig flash drives and just for the click bait it's hyper converged it's all the new rage now in enterprise but we're really going to do it we're going to hyper converge these guys using seth the cluster file system so let's head off on this adventure and see what we can learn about seth so it's true i did pay a hundred dollars for these three thin clients they were 35 dollars each with a 10 discount for buying in bulk but each of them only has a 16 gig ssd built in and at the time they only had four gigs of ram so it wasn't super useful as a cluster you could run maybe one virtual machine per node and with 16 gigs of storage you weren't going to fit much at all once you install proxmox itself so for that video i use shared storage on a nas and i didn't include that in my price it's in the room behind us it costs about a thousand dollars hard drive prices these days so today we're going to fix both of those problems the low ram and the storage cost these little suckers right here 128 gigs for like 15 bucks not a bad deal i mean they're they're pretty slow they're usb usb 3 but they're probably not reliable either but nah it hits our price target and as you saw in another video i upgraded the ram on these guys to 12 gigs each i spent 75 dollars on the ram upgrade and 45 on the ssds so we're going to call this the 250 dollar cluster now ceph is a really big topic like it's really really big literally it scales out into petabytes we're not going to scale into petabytes today today we're limiting our focus on proxmox's implementation of ceph and using proxmox on ceph to store our virtual machine images for this three node proxmox cluster using a three node theft cluster running on the same hardware of course i did this for 250 bucks using this junk you could do this with nicer hardware and follow the same steps as i did so i got the three node cluster again we're going to cluster them as we did in the last video so i'm going to create the cluster okay so now these nodes are clustered so now we can install ceph so i'm going to install ceph on the first node here and then we'll have to do that so if we click on ceph it says seth is not installed in this node would you like to install it now we're going to say yes so because we don't have a cef cluster at all yet this wizard is going to walk us through configuring the cluster on this proxmon cluster so we clearly don't have anything installed so we're going to choose specific which is the latest version start that so now it's going to install ceph then it's going to let us configure our cluster and once we've configured it it'll copy that configuration to everything else yes okay now this is done we're going to click next so now it's going to let us configure the network so ceph has two network interfaces it uses what it calls the public network and the private network so private network instep is the network that sef uses to communicate with itself so if seth has to rebalance data across the cluster he uses the private network cluster now the public network is not necessarily public it's the network that seth clients use to access the data in the set system so if you have a proxmox cluster proxmox itself will be considered a client of ceph so proxmox will go across the public network to access data onset when ceph then gets a request for data to a certain osd and that requires replication the osd might go to other nodes via the cluster network for replication purposes so in this case we only have one network so we're going to set them both the same and we have a couple options here so we need to choose the monitor node so in ceph there's two primary daemons that we deal with there's the monitor and the osd so the monitor stores the global state of the cluster it's called the cluster map and the monitor is redundant you can run as many copies as you want but in general you should run three copies if you have a bigger cluster you can run more but you don't need more than three proxmox is going to set it up for us on pve one by default we'll add more later so this should configure our initial cluster with one monitor and no drives so now before i go and i add drives i'm going to add i'm going to install ceph on all of the other nodes so if i click ceph on pv2 what's lsf i'll click that off on pb3 as well okay so it says pacific installed successfully so we click next it says configuration already initialized because we already configured our stuff cluster foxbox just imported that so we're done so now we need to install the monitor daemon on pv2 and once pv3 is done installing pv3 as well so go here and click monitor we're going to create a monitor on pv2 and proxmox will do that let's see so pv3 is done installing so we can create a monitor on pb3 as well so now we have monitors created on all three nodes the ceph monitor is separate from proxmox's cluster system so once you cluster with proxmox the ceph cluster is essentially running on the same hardware but it's its own cluster with its own cluster management its own synchronization and the monitors are how ceph deals with synchronization and they will deal with themselves on their own you don't have to worry about quorum or anything like that as long as you have at least three monitors the manager is a little bit different the manager is essentially a statistics gui and things like that for a set so the manager is not critical you can create more than one if you want but you don't have to so seth also has its own dashboard gui and if we like we can install that so i'm going to install that on pve one so apt let's fall set your dashboard so now we have to enable the dashboard with something called step npr module enable dashboard we need to add a self-signed certificate so seth has a tool to do that called ceph dashboard create self-signed certificate there we go and next step wants us to create an account in their system so ceph dashboard user create ac user create call them admin okay it needs a file with the password so we'll create a file temporarily password two week ah there we go actually enforces password security which is great for you guys but not great for me trying to test this so finally now that we have enabled an admin and we've created a soft science certificate we have to disable and re-enable the dashboard okay so here is what seth's dashboard looks like so we have a health warning because we have no osds and we have no capacity no objects and no pg's because we have no disks yet so that concludes installing ceph now we have to get our disk setup so if you're familiar with zfs you probably know that there's a lot of different ways you can configure disks in ceph some of that is not done at the disk level and some of that is the simplest form of the disk is a single disk osd which stands for object storage daemon so we're going to install one of those on pve one i'm going to go into the shell on pve one and we're going to find out what drives we have so if we look at our dev and look for data devices so we have sda one two and three and sdb and i happen to know sdb is right but if you want to make sure you can do ls dash l you have disk by id so now we see ata 16 gig sata flash drive so that's the internal storage on the dell 5060 and that goes to sda and then it has three partitions which go to sda one two three and then there's some other stuff on sdb which is probably a previous install of something but it's usb sandisk 3.2 1 that's sdb so now we're going to get rid of everything that was on that disk to make sure that it's completely wiped so you don't have any problems going forward so ceph volume lvm that you have sdb destroyed there we go zapping successful so now you want to zap all of the drives you're going to use on all of your systems okay so we've zapped all the drives now we can go here you e1 and scroll down to ceph and click osd and we're going to create an osd so you're going to have to do this for every single drive in your cluster individually so first off the disk this is where the actual data is stored in this case the only free disk it found is dev sdb on pde1 and it's 123.06 gigs so that's the disk we're going to use that's our 128 gig flash drive now there are two other disks you could potentially use the dv disk and the wall disk so what is the db and the wall so if you're familiar with zfs zfs has a special disk type called a slog and the slog is used for synchronous writes it stores the zfs intent log synchronously so that synchronous rights can complete with lower latency it does not improve throughput of the zfs pool a similar thing is here in seth so in sap instead of the pool as a whole managing synchronous rights each individual osd is responsible for managing synchronized synchronization of its rights so if you're using slow spinning rust here for your disk and you care a lot about synchronous write performance you could take a flash drive even even a small one and allocate it as the wall disk and that would help speed up synchronous writes dbdisk is kind of similar so in ceph there's no central database of where something is stored instead that is computed based on the pool information and a thing called the crush map which the monitor is storing and distributes all the clients so the crush map contains a list of all of the osds in the system so every single hard drive in the entire cluster is stored in the crush map and from the crush map the client can compute which drive in the system it should store its data on it's called a placement group but then the drive has to figure out where that data is on the actual disk so each osd has a database of where data is stored on the drive itself because the clients know they need to go to a specific drive in the system based on the crush map but then the drive has to figure out what block to put it on on itself so if you have systems that are mixed ssd and hard drive you could either use the ssds as separate osds entirely or you could use them as db and wall disks and you're allowed to partition them and use a single ssd as a db or wall or more than one hard drive if you want the rule of thumb is that the db disk is between two and four percent of the size the data disk if your data disk was say 10 terabytes you would want to have around 400 gigabytes available for the db disk depending on your use cases and if you're using rbd or sfs those change the amount of metadata a little bit the wall disc does not have to be very big it's going to be very tiny several gigabytes so in this case i'm going to use the os osd disk because i don't have anything that's faster to store the db or the wall on and next is the device class so here this is just a keyword that's added to the crush map so you can say it's an hdd ssd or nvme and if you configure your own set cluster you can add your own items here too but usually you just pick one of these and later on we have the option of saying that a certain pool should always be stored on a certain device class and that can be used to speed up access to certain types of data without having to to create more than one ceph cluster so if you have high speed you could use ssd or nvme for your vm disks while you use ssds or hdds for your large bulk file storage or something like that in this case because it's a usb drive i'm going to say it's an hdd again there's a warning here don't use hardware raid pass every single disk through to the host and create a separate osd for each one unless you're using the like a an ssd as a dv or wall disc every single disk should have its own osd so we create that now we have one osd it's named osd 0 it's in hdd it is up and it is in and we still don't have enough osds to make a full cluster so let's go to the other nodes and do the same thing now we have all three osds by default ceph will name it osd dot a number and it will always choose the lowest available number so if you're going to replace a drive you delete the existing one that leaves a hole where the number used to be and you add the new drive back in it'll claim the number from the previous drive that messes up the crush map the the least amount so you'll notice here this is a bit of a tree so we have pve three has ost ii pv2 has osd one pv one has osd zero seth has a concept of redundancy groups and by default the hierarchy is to have an osd is the lowest level of redundancy and above that is the host and they don't have anything above that but if you like you can configure your own things so you could have say a group for each rack a group for each aisle a group for each data center and that builds a hierarchy of which data center il and rack each osd is in then you can say where you want the data to be distributed across the failure domains so by default ceph is going to be configured to have two failure domains as the osd level and the node level and they're going to want replication across the node level that means we can deal with the failure of any node and this cluster will still be fine so if you have a whole bunch of storage on one node and not a lot on the other nodes it might have a hard time allocating placement groups because it can't fulfill the rules that it must keep data on separate nodes and so you have the option of changing that but i'm not going to do that today now if we come back here we see we have three osd's one placement group group everything is happy we have 343.8 gigs total in the system which is great and we're not doing any anything special roxmox has their own gui here as well if you view ceph so we see three osds in and up all the pgs are active and clean two managers no metadata servers metadata servers are only used for cfs and today we're only going to do a raido's block device there we go so we have added three osds now that is the minimum for a cluster now we can learn about pools so pools by default we have a pool that ceph uses to store its own health metrics that's not very useful to us so what is a pool so if you're familiar with zfs or zfs you know that you can arrange data in a hierarchy of data sets and those data sets usually contain data like a file system and ceph is kind of similar in that you have pools and pools contain data but each pool can have a different rule of how the data should be stored so unlike in zfs it's possible to create a pool of data that has triple redundancy so all the data is triplicated across the cluster and also create a pool with no redundancy or with erasure coding which is similar to raid so the first pool we're going to create is going to be a store vm images and we're going to use replication for this which means the data is stored multiple times across the cluster so we're going to go here and click create so the name is going to be called what do we want to call it so step replicate this is going to become the name of the storage in proxvox as well so size of three that tells us to replicate all of the data three times and min size of two that tells us that we're allowed to operate on the data as long as at least two of the copies are valid so if we have a three node cluster with a size of three that means all of the data will be copied to all of the nodes if one of the nodes drops out we still have two nodes and the cluster is still happy if a second node drops out we're down to one then the pool becomes read only and doesn't allow access anymore this is usually a good setup for virtual machine images a size of three and a min size of two and if you're going to use replicated pools i recommend you stick with these numbers if you need a lot of data security you can up the size number but remember it's going to try to replicate this across different hosts so if you have a lot of drives in a small number of hosts you might have to change the cef backend to allow it to store data on multiple osds instead of multiple hosts so we're going to click create and it's going to add it as a storage to proxmox so now we should be able to install a virtual machine on the cef replicate pool and for simplicity i'm going to add a samba store where i keep isos so i added my nas there are iso images here a whole bunch of them so we're going to create a virtual machine and install it and see what happens to seth as we do that so here we get to choose our soft pool we're going to use the replicated pool we'll get the 32 gigs of disk space ssd emulation and discard our favorite flags i don't know two course two course seems fine okay so i have ubuntu jammy jellyfish installing itself so now we can look at what happens on this left side so we've got 65 placement groups now we have a tiny amount of io going on very small amount so writing some megabytes per second not a lot one of the downsides of seth is that it's not necessarily designed to be as fast as something like cfs it's designed to scale to a much broader level so zfs is a scale-up file system you build a bigger and bigger and bigger server that has a ton of resources and handles a ton of data very quickly with ceph you scale out you add more servers and by adding more servers you increase the throughput of the system but a lot of data is still going over the network connection and the fact that i have one gig ethernet here is really going to limit things we look here on the health page we notice that the process happened so slowly we actually got into a health warning scenario so basically what happened was our drives were so slow or network was so slow that it wasn't able to replicate things as fast as the vm was writing data so it's trying to keep three copies of everything but sometimes it gets into a state where it only has two it allows the i o to complete as long as there's at least two copies that are finished and if the third copy takes a long time then it says it's degraded and has to recover it so what could potentially happen is if one of your hard drives is very slow or it fails you'll get into a state where it'll say degraded but the data is still fine and active it just doesn't have as many copies as the rules say it should but as long as it's above the min size it'll still let the vms operate so just to demonstrate that i'm going to intentionally take one of the osds out i'm going to choose pve 2's osd so i'm going to say sef osp out list.1 because osd.1 is on pve 2. and this is going to simulate the osd failing so now we get a health warning here it says one pg inactive so we have two that are up and in and one that is up and out so out means it is not part of the cluster now the cluster is going to try to figure out how it can recover and because it lost this one drive what it's going to want to do is move all of the data to other drives but because we only have three drives and it wants three live three levels of redundancy it's going to be in a persistent state where it doesn't have enough osds to replicate but nonetheless the vm continues operating so bring that drive back in so now we brought the osd back in now it's back in now it has to rebalance all the data that got stored while the osd was gone so if you have a three node cluster and you temporarily lose a node the cluster is going to keep running but as soon as that node comes back up you're going to do a ton of i o across the cluster to rebalance all the data so all the data still has the same level of redundancy that it should and all the data that should have been stored in that server while that server was down has to get rebalanced in this case we didn't lose a ton of data we didn't lose any data but we didn't miss the replication of a lot of data because it was only out for a few seconds so we've just scratched the surface of what seth can do like i said earlier cephas big seth is really really really big for big big data but that doesn't mean we can't use it in the home lab we get some advantages we get high availability of the file system we get to configure redundancy on a per pool level which is something i didn't do in this episode but i'm sure i'll touch on it later we get nice integration in proxmox i'm sure there's many other advantages too ceph has a wonderful file system and between ceph and zfs you can pretty much cover all of your storage needs so what's next for this cluster well today i was only able to fit in a really basic replicated storage pool i'd like to go into more details on erasure coding and the different ways you can configure ceph just for block devices and i'd also like to get into ceph fs the file system but i did not have time for that today seth is a big big thing and it takes a lot of time to explain so hopefully you'll like and subscribe and stay with me so we can do more of this project in the future thanks for watching bye you
Info
Channel: apalrd's adventures
Views: 57,719
Rating: undefined out of 5
Keywords:
Id: Vd8GG9twjRU
Channel Id: undefined
Length: 22min 57sec (1377 seconds)
Published: Thu May 05 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.