OSD Replacement Procedures
Check which disks needs to be put back in procedures.
- To see which osds are down, check with
ceph osd tree down out.
[09:28][root@p06253939p44623 (production:ceph/beesly/osd*24) ~]# ceph osd tree down out ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 5589.18994 root default -2 4428.02979 room 0513-R-0050 -6 917.25500 rack RA09 -7 131.03999 host p06253939j03957 430 5.45999 osd.430 down 0 1.00000 -19 131.03999 host p06253939s09190 24 5.45999 osd.24 down 0 1.00000 405 5.45999 osd.405 down 0 1.00000 -9 786.23901 rack RA13 -11 131.03999 host p06253939b84659 101 5.45999 osd.101 down 0 1.00000 -32 131.03999 host p06253939u19068 577 5.45999 osd.577 down 0 1.00000 -14 895.43903 rack RA17 -34 125.58000 host p06253939f99921 742 5.45999 osd.742 down 0 1.00000 -22 125.58000 host p06253939h70655 646 5.45999 osd.646 down 0 1.00000 659 5.45999 osd.659 down 0 1.00000 718 5.45999 osd.718 down 0 1.00000 -26 131.03999 host p06253939v20205 650 5.45999 osd.650 down 0 1.00000 -33 131.03999 host p06253939w66726 362 5.45999 osd.362 down 0 1.00000 654 5.45999 osd.654 down 0 1.00000
- Check the tickets for the machines in Service Now. Those who interest us are the named :
- If the repair service replaced the disk(s), it will be written in the ticket. So you can continue on the next step.
On the OSD:
LVM formatting using
- Simple format: osd as logical volume of one disk
This is a sample output of listing the disks in lvm fashion. You will notice the number of devices (disks) in each osd is one. Also these devices don't use any ssds for performance boost.
(Ceph volume listing takes some time to complete)
[13:55][root@p05972678e21448 (production:ceph/erin/osd*30) ~]# ceph-volume lvm list ===== osd.335 ====== [block] /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba type block osd id 335 cluster fsid eecca9ab-161c-474c-9521-0e5118612dbb cluster name ceph osd fsid c2fd8d2e-8a38-42b7-a03c-9285f2b973ba encrypted 0 cephx lockbox secret block uuid PXCHQW-4aXo-isAR-NdYU-3FQ2-E18Q-whJa92 block device /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba vdo 0 crush device class None devices /dev/sdw ===== osd.311 ====== [block] /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e type block osd id 311 cluster fsid eecca9ab-161c-474c-9521-0e5118612dbb cluster name ceph osd fsid 1bfad506-c450-4116-8ba5-ac356be87a9e encrypted 0 cephx lockbox secret block uuid O5fYcf-aGW8-NVWC-lr5G-BsuC-Yx3H-WZl24a block device /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e vdo 0 crush device class None devices /dev/sdt
This is an example of an osd that uses an ssd for its metadata. It has a db part in which the metadata is stored.
[14:04][root@p06253939e35392 (production:ceph/dwight/osd*24) ~]# ceph-volume lvm list ====== osd.29 ====== [block] /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48 type block osd id 29 cluster fsid dd535a7e-4647-4bee-853d-f34112615f81 cluster name ceph osd fsid dff889e7-5db5-4c5e-9aab-151e8ad17b48 db device /dev/sdac3 encrypted 0 db uuid 9762cd49-8f1c-4c29-88ca-ff78f6bdd35c cephx lockbox secret block uuid HuzwbL-mVvi-Ubve-1C5D-fjeh-dmZq-ivNNnY block device /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48 crush device class None devices /dev/sdk [ db] /dev/sdac3 PARTUUID 9762cd49-8f1c-4c29-88ca-ff78f6bdd35c ====== osd.88 ====== [block] /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558 type block osd id 88 cluster fsid dd535a7e-4647-4bee-853d-f34112615f81 cluster name ceph osd fsid f19541f6-42b2-4612-a700-ec5ac8ed4558 db device /dev/sdab6 encrypted 0 db uuid f0b652e1-0161-4583-a50b-45a0a2348e9a cephx lockbox secret block uuid cHqcZG-wsON-P9Lw-4pTa-R1pd-GUwR-iqCMBg block device /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558 crush device class None devices /dev/sdu [ db] /dev/sdab6 PARTUUID f0b652e1-0161-4583-a50b-45a0a2348e9a
One way is to have an ssd and do a simple partitioning. Each partition will be attached to an osd. If the ssd part is broken, eg the disk failed, all the osds who use this ssd will be rendered useless. Therefore each osd has to be replaced. There is also a chance that the ssd is formatted through lvm, the metadata database part will look like this:
[ db] /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85 type db osd id 220 cluster fsid e7681812-f2b2-41d1-9009-48b00e614153 cluster name ceph osd fsid 81f9ed48-d27d-44b6-9ac0-f04799b5d0d5 db device /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85 encrypted 0 db uuid wZnT18-ZcHh-jcKn-pHic-LCrL-Gpqh-Srq1VL cephx lockbox secret block uuid z8CwSn-Iap5-3xDX-v0NA-9smI-EHx1-EGxdAR block device /dev/ceph-6bb8b94b-4974-44b0-ae6c-667896807328/osd-a59a8661-c966-443b-9384-b2676a3d42d8 vdo 0 crush device class None devices /dev/md125
Replacement procedure: one disk per osd
ceph-volume lvm listis slow, save its output to
~/ceph-volume.outand work with that file instead.
- Check if the ssd device exists and it is failed.
- Check if it is used as a metadata database for osds, or as a regular osd.
- If it is a metadata database:
- Locate all osds that use it (lvm list + grep)
- Follow the procedure for each affected osd
- Treat it as a regular osd (normal replacement)
- If it is a metadata database:
- Mark out the osd:
ceph osd out $OSD_ID
- Destroy the osd:
ceph osd destroy $OSD_ID --yes-i-really-mean-it
- Stop the osd daemon:
systemctl stop ceph-osd@OSD_ID
- Unmount the filesystem:
- If the osd has uses a metadata database (on ssds)
- If it is a regular partition, remove the partition I guess
- If it's an lvm, remove it:
- eg for "/dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85"
ceph-volume lvm zap /dev/sdXX --destroy
- In case
ceph-volumefails to list the defected devices or zap the disks. You can get the information you need through
lvs -o devices,lv_tags | grep type=blockand use
vgremoveinstead for the osd block.
- In case you can't get any information through
lvsabout the defective devices, you should list the working osds and
umountthe unused folders with:
$ umount `ls -d -1 /var/lib/ceph/osd/* | grep -v -f <(grep -Po '(?<=osd\.)[0-9]+' ceph-volume.out)`
- Now you should wait until the devices have been replaced. Skip this meaningful step if they are already replaced.
- If the osd has any metadata database used elsewhere (ssd) you should prepare it in case of lvm.
For naming we use
cache-`uuid -v4`. Just recreate the lvm you removed at step 7 with:
lvcreate -name $name -l 100%FREE $VG. Lvm has three categories, PVs which are the physical devices (e.g /dev/sda), VGs which are the volume groups that contain one or more physical devices. And LVs are the "paritions" of VGs. For simplicity we use 1 PV per VG, and one LV per VG. In case you have more than one LV per VG, when you recreate it use e.g for 4 LVs per VG
- Recreate the OSD using ceph volume, use a destroyed osd's id from the same host
$ ceph-volume lvm create --bluestore --data /dev/sdXXX --block.db (VG/LV or ssd partition) --osd-id XXX
Replacement procedure: two disks striped (raid 0) per osd
- Run this script with the defective device
ceph-scripts/tools/lvm_stripe_remove.sh /dev/sdXXX(it doesn't take a list of devices)
- The program will report what cleanup it did, you will need the 2nd and 3rd line, which are the two disks that make the failed osd and the last line which is the osd id.
- In any case the script failed, you can open it, as it is documented and follow the steps manually.
- If you have more than one osd replacement you can repeat steps 1 and 2, as the 5th step can done in the end.
- Pass the set of disks from step 1 after you have all of them working on this script:
lsinside so you can use wildcards if you are bored to write '/dev/sdX' all the time.
- It will output a list of commands to be executed in order, run all EXCEPT THE
ceph-volume createone. Add at the end of the
ceph-volume createline the argument
--osd-id XXXwith the number of the destroyed osd id, and run the command.