How to Replace CEPH OSD inside Proxmox Cluster | Proxmox Home Server Series | Proxmox Home Lab

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone and thank you much for watching this is me Mr P and welcome back to another episode of The prox smok's Home Server Series in this video I will show you what you need to do to replace a drive inside your SEF pool inside proxos cluster so let's begin each node has additional Drive attached to it and that drive is used inside SEF pool SEF pool contains of three drives OSD 0 1 and two and if I click on a SEF inside any of the notes I can see that se SE pool is healthy everything is great all the managers and monitors have been detected everything is working and I have one lxc container running on a node one which constantly pinging Cloud flare tunnel Cloud flare DNS server so what to do if one of the drive fails if I click on the SEF and then choose osds let's say my node one OSD Drive started to misbehave couple of days ago this happened to my main proxos cluster where one of the two terabyte SSD drives started to show very huge numbers of applying commit latency other two drives were showing between 5 to 7 Mill milliseconds where the drive that I had to replace was showing 500 milliseconds that was indication that the drive his struggles to apply and commit read and write so I had to replace the drive so we're going to do the same thinging here we will replace node one OSD drive so first thing what you need to do to make to start replacing this drive is to instruct the SEF to migrate the data out of his drive I have one lxc container running and if I click on the resources I can see my the lxc containers root dis is located in SEF and it's 8 GB in size that means that this drive is being spread across three of these OSD drives I need to instruct the SEF to migrate all the data from this drive into other two to do that I will select the drive that I want to remove and click out out that will instruct the SEF to migrate the data off this drive if I go on right now on the list and choose SEF as you can see it's happening right now all this it will depend how long it's going to take it depends how much data you have inside the drive this is completed as this only 8 GB in size actually it's only 1.3 GB or so in size been used so let it this is stopped and the data is been migrated off OSD Z which belongs to node one into other two again this will depend how fast is going to happen it will depend how much data you have inside just SEF pool so once I can see that everything that's it data is being migrated and SEF pool still showing this is healthy even if I instructed the data to be migrated out of the out of the one node of out of one OSD drive I'll click back into OSD list select the drive and press stop that's basically disconnecting this drive from this SEF pool now if I'm click on a SEF I will get information or message warning that a degraded data redundancy and it's basically 1/3 of the pool is not working properly and as you can see all is in yellow and it's giving me a list that is two and one is out and down and basically SEF right now is telling me that SEF pool is degraded I need to replace this drive as soon as possible I'll go back inside OSD lists I will select this drive and under more options I will choose destroy what that does is not only stops and deletes the data of the drive but it completely removes the drive of a SEF pool you need to do that otherwise you always end up or you will basically get stuck with the drive that is showing on the list but is doesn't exist so more destroy make sure the tick is here and click remove right now SEF is basically destroying this drive making sure that it's definitely out as you can see pv3 and pv2 contains osds pv1 has no Drive attached if I click back on a SE right now Seth is complaining more that OSD count is two where my default size supposed to be fre so I need to replace this drive first thing I need to shut down this note I will select on this lxc container right click and migrate that to pv2 migrate as this is alexc container proximo ha first have to shut it down migrate and start as and because this very small BM drive or lxc drive you just start happen all this in just couple of seconds if I click on a console I need to log in as this restarted and continue pinging my cloud flare DNS server so right now pv1 is ready to be shut down same is going to be on yours even if I have these all virtualized you just shut down the VM should shut down the node and replace the drive so select the node one shell and I type shutdown now and press enter and right now I need to wait for note to be completely off and replace the drive because node one is out I need to access my proximos cluster via node 2 so I'm going to go via node 2 and log into proximos cluster let's enter the password a username and a password let's make it a bit bigger like that so that's it so I have pv3 and two is active pv1 is offline so right now Drive is being replaced and I will start the node so node is starting so any second this red um icon should change into a green icon so let's wait for a second or so for this to happen and here we are I have my node one back up and running if I expand that and click on node one scroll down to the Diss and I can see that dis is been attached if I go and click on a SE seph is still complaining that this is not working properly and is even telling me that the node pv2 and three clock SK skew detected that means that one of the nodes just went offline so that you need to wait a couple of seconds for all of them to go green as you can see while I was talking they all basically got synced so that's it nodes can see each other they're all in sync but osds are still down by one so I need to add this new drive if I click on node one SEF expand this option click on osds and then you need to click create OSD this automatically detects the free drive and I'm just going to click create so what what's happening now is SE takes the fresh drives fresh drive and just getting all this set up and done and includes into a SE pool just give a second or so and it should show up if it doesn't just click on reload button and here we are always z0 is showing up here on the list and right now it says in but it's down because SEF is preparing to start this drive it's not more like a turning the power into a drive it's just preparing the drive of all the configuration files to properly include that into a SE pool so after a couple of seconds or so just click reload again and here we are we back up and running so I have my OSD so the drive is in as you can see the latency and commit is 59 when others are five and five this is fine because this is a new drive so it's getting all the data written so it's a lot of data is getting pushed to this drive right now from other two so this number will be a bit higher than other two if I reload as you can see right now basically this has no Laten at all and these two exactly the same and right now if I'm click on a SEF I can see my SEF pool is healthy everything is working fine and if I open the node two and click on LXI container console that is still pinging the cloud F now I can go back and migrate this Alex container into node one if that's where it's supposed to be or I can keep that on node two I'm just going to migrate to node one select pv1 and migrate so prox Mo ha shutting down LX container and starting migration migration finished and proxo should start Le container any a second and let's reclick on the console enter and then rout give it a password and this lxc container can continue functioning as it was before and that's it right now my SE pool is green all green all happy all healthy and osds are being restored or OSD Drive sorry is been replaced with a fresh new one and that's it they're all showing somewhere around the same latency thank you much for watching I hope you enjoyed this video I would be very happy if you consider subscribing or clicking like button let me know what you think about this content and I'll see you in the next one goodbye

Info

Channel: MRP

Views: 1,537

Rating: undefined out of 5

Keywords: Proxmox home server, proxmox home lab, proxmox home server series, ceph, ceph cluster, proxmox ceph, how to replace CEPH drive, how to replace ceph drive in a pool, proxmox ceph cluster

Id: 6L5SlMSGg1c

Channel Id: undefined

Length: 8min 59sec (539 seconds)

Published: Mon Mar 11 2024