User's Guide
This chapter provides minimal Ceph user documentation.
Block Storage and CephFS documentation is available at the following URLs:
- https://clouddocs.web.cern.ch/clouddocs/details/volumes.html
- https://clouddocs.web.cern.ch/clouddocs/file_shares/index.html
Which Storage Service is Right for Me?
I need an extra drive for my OpenStack VM:
- See block storage
I need a POSIX filesystem shared across a small number of servers:
I need storage which is accessible from lxplus, lxbatch, or the WLCG:
- See EOS
I need to share my files or collaborate with colleages:
- See CERNbox
I need HTTP accessible cloud storage for my application:
- See S3
I need to distribute static software or data globally:
- See CVMFS
I need to archive data to tape:
- See CASTOR
Using Block Storage
Block storage is accessible via OpenStack VMs as documented here: https://clouddocs.web.cern.ch/clouddocs/details/volumes.html
Using CephFS
CephFS is made available via OpenStack Manila. See https://clouddocs.web.cern.ch/file_shares/index.html for more info.
Using S3 or SWIFT
S3/Swift is made available via OpenStack. See https://clouddocs.web.cern.ch/object_store/README.html for more info.
Configure aws cli
The aws s3api is useful for doing advanced s3 operations, e.g. dealing with object versions. The following explains how to set this up with our s3.cern.ch endpoint.
Setting up aws
All of the information required to set up aws-cli can be found in the existing .s3cfg file used when using S3.
$> yum install awscli
$> aws configure
AWS Access Key ID [None]: <your access key>
AWS Secret Access Key [None]: <your secret key>
Default region name [None]:
Default output format [None]:
Testing
$> aws --endpoint-url=http://s3.cern.ch s3api list-buckets
{
"Buckets": [
{
"Name": <bucket1>,
"CreationDate": <timestamp>
},
{
....
}
],
"Owner": {
"DisplayName": <owner>,
"ID": <owner id>
}
}
Delete all object versions
We provide here a script to help user make sure all versions of their objects are deleted.
Usage:
$> ./s3-delete-all-object-versions.sh -b <bucket> [-f]
-b: bucket name to be cleaned up
-f: if omitted, the script will simply display a summary of actions. Add -f to execute them.
Useful links
Restic Cephfs Backup (Automatic)
This guide will cover the basic architecture and operations on the distributed backup system using restic
All the puppet configuration is under the following hostgroup structure:
ceph/restic/
ceph/restic/agent
ceph/restic/agent/backup
The code of the different scripts reside in the following git repository:
Architecture
These are the actual components of the current system and their role:
cephrestic-backup-NN (cephrestic-backup.cern.ch)
Stateless nodes and actual workers of the system. This nodes contain a restic agent each, which is always running
and checking for new backup jobs every 5 seconds. When a job is found, the agent will handle the backup copying files
from cephfs
to s3
.
cback-switch
This daemon runs every hour at a random minute in every agent and changes the status of Completed
backups after
24 hours so they become Pending
(check Operating section). This daemon will do the same process for
the prune mechanism, making Pending
all the jobs with no recent prune in the last week.
S3 Storage
This is where we store the backups. Each user has its own bucket named like cboxback-<user_name>
(cboxbackproj-svc_account
for the projects). Every restic agent
has the utility s3cmd
installed and configured so we can list the actual buckets:
s3cmd ls
CAUTION: Not needed to say, but deleting the S3 bucket will delete all backup data and snapshot information. The backup won't fail, instead a new fresh backup will be triggered. So, take care while operating the bucket directly and eventually disable the related backup job
cback backup disable <id>
.
Configuration
The basic configuration of the backup system is done by config files managed by puppet through hiera.
These config files reside in /etc/cback/cback-<type-of-agent>-config.json
.
The available configuration parameters are explained in each data hostgroup (hiera) file.
Command Line Interface (cback)
For operating the system there is a command line tool called cback
. This tool is available in any of the backup, prune
or restore agents. This tool is still in development so always check cback --help
to see the actual commands.
Operating Backup
- Check backup status
cback backup status
These are the possible backup status:
Enabled
Only the enabled jobs will be taken into account by the backup agents.Pending
The backup is ready to be backed up. Any available agent will pick this job whenever is free unless the job is disabled.Running
The job is running at that moment. Checkcback backup status
to see which agent is taking care of the job.Failed
There was a problem with that backup. Checkcback backup status user_name | job_id
to check what went bad.Completed
The last backup was successful. This is not a permanent state. After the default 24 hours, the status will be changed toPending
Only the jobs Enabled
+ Pending
and prune status different to Running
will be processed by the backup agents. The command cback backup reset <id>
will
set the status to Pending
.
- Check the status of a particular user or backup id:
cback backup status rvalverd
- List all backups
cback backup ls
- List all backups by status:
cback backup ls [failed|completed|running|pending|disabled]
- Enable / disable a backup job
cback backup enable|disable <backup_id>
NOTE: This command does not stop a running backup. In case that the backup is running, it will go until the end but won't be available for the subsequent backups.
- Reset a backup job (changes the status to
Pending
)
cback backup reset <backup_id>
- Add a new backup job
cback backup add <user_name> <instance> <path> [--bucket-prefix=<prefix>] [--bucket-name=<name>] [--enable]
Example:
cback backup add rvalverd cvmfs /cephfs-flax/volumes/_nogroup/234234 --enable
This will add a new backup job and will store the specified path on a bucket called cephback-rvalverd
. The bucket
will be created automatically on the first run of the backup.
NOTE 1: By default, the
<user_name>
flag will be used to generate the name of the bucket concatenating it with the bucket prefix (by defaultcephback-
). If user_name isrvalverd
, the bucket name will be namedcephback-rvalverd
.
NOTE 2: It's possible to add more than one backup per user as long as the path is different
NOTE 3: If the instance does not exist, it will be created automatically. This field is only used for categorizing the jobs, so does not need to match an existing ceph instance and is not used in the actual backup logic.
NOTE 4: All the backup jobs are added as
Pending
+Disabled
by default unless--enable
flag is set, which will add the backup asPending
+Enabled
. The flag--enable
will also set prune asEnabled
+Pending
.
NOTE 5: If
--bucket-prefix
is not specified, the default will be used:cephback-
. This is configurable through Puppet.
NOTE 6: If
--bucket-name
is specified, its value will be used instead of any other combination
NOTE 7: S3 repository will be created automatically by the backup agent on the first run of the backup.
- Delete backup job. A interactive shell will be presented to delete backup metadata and also S3 bucket contents if needed. Use it with care, no recovery is possible. Is not possible to delete backups in running status.
cback backup delete <backup_id>
Restoring a backup
Currently, refer to the restic documentation in order to recover the data.
For operating the repository using restic you need to:
- Source the enviroment configuration:
source /etc/cback/restic_env
- Get the url of the backup to operate:
cback backup status <user_name | backup_id>
- Run normal restic commands:
restic -r s3:s3.cern.ch/cephback-rvalverd snaphots|restore|find ...
Refer to restic help
for all available options.
Scaling the System
Vertically:
- You can run as many process as you wish in any agent spawning a new process like
systemctl start cback-<type_of_agent>@<new_agent_id>
For example:
If we have only one agent in cephrestic-backup-01
we can do the following to have two:
[rvalverd@cephrestic-backup-01]$ systemctl start cback-backup@2
The number of agents to run in each machine is not managed by puppet (currently) so changes are persistent. If an agent crashes won't be restarted by puppet. This will be addressed in further versions of the system.
Horizontally:
You need to spawn a new machine in the required hostgroup:
- backup agent:
ceph/restic/agent/backup
- prune agent:
ceph/restic/agent/prune
- restore agent:
ceph/restic/agent/restore
For example, for adding a new backup agent N
(we assume that we have N-1
currently)
[rvalverd@aiadm09 ~]$ eval `ai-rc "IT Ceph Storage Service"`
ai-bs --landb-mainuser ceph-admins --landb-responsible ceph-admins --nova-flavor m2.large --cc7 -g ceph/restic/agent/backup --foreman-environment qa cephrestic-backup-N.cern.ch
Adding the node to the load balanced alias:
openstack server set --property landb-alias=cephrestic-backup--load-N- cephrestic-backup-N
Once the installation is done and puppet is done, you need to log-in to the machine and start the daemon (this will be done automatically in a further version of the system):
[rvalverd@cephrestic-backup-N]$ systemctl start cback-backup@1
After that, the agent should start pulling jobs
Using the log system
The log of any agent could be found on /var/log/cback/cback-<type_agent>.log
You can grep for the job_id
for convenience, for example:
cat /var/log/cback/cback-backup.log | grep 3452
Operating with the backup repository using upstream Restic
As the system uses upstream version of restic, the backup repository could be managed directly. Restic is installed in all backup agents.
- First, you need to source the configuration:
source /etc/cback/restic_env
NOTE: If that file is not available, you can export the contents of /etc/sysconfig/restic_env
And then, you can refer to restic documentation about how to use the tool.
- Here is an example of how to list the available snapshots of one backup:
restic -r s3:s3.cern.ch/cephback-rvalverd snapshots
For convenience or long debugging sessions, you can also seed the repository information as a environmental variable:
export RESTIC_REPOSITORY=s3:s3.cern.ch/cephback-rvalverd
This way you don't need to specify the -r
flag every time.
- Here is another example about how to mount the repository as a filesystem (read-only):
restic -r s3:s3.cern.ch/cephback-rvalverd mount /mnt
Data backup with Restic (manual)
This document describes how to backup your block storage or CephFS with restic. Here we describe backing up to S3, but the tool supports several other backends as well.
Restic/S3 Setup
export RESTIC_REPOSITORY=s3:s3.cern.ch/<my_backup_repo>
export RESTIC_PASSWORD_FILE=<secret_path_of_a_file_with_the_repo_pass_inside>
export AWS_ACCESS_KEY_ID=<s3_access_key>
export AWS_SECRET_ACCESS_KEY=<s3_secret_access_key>
Restic Download / Install
Initialize Backup Repository
restic init
Backup
restic backup <my_share>
NOTE: By default, restic place the cache files on $HOME/.cache, if you want to specify another path for the cache you can use the
--cache-dir <dir>
flag.
Restore
There are two options, directly using the restic restore
command or mounting the backup repository and copy the files
from it.
Directly
- List backup snapshots
restic snapshots
- Restore the selected snapshot
restic restore <snapshot_id> --target <target_path>
NOTE: you can use
restic find
to look for specific files inside a snapshot.
Using the mount option
- You can browse your backup repository using fuse
restic mount /mnt/<my_repo>
NOTE: You can run
restic snapshots
to see the correlation between the snapshot id and the folder.
Delete a snapshot
- List snapshots
restic snapshots
- Forget snapshot
restic forget <snapshot_id>
Interesting flags for restic forget
-l, --keep-last n keep the last n snapshots
-H, --keep-hourly n keep the last n hourly snapshots
-d, --keep-daily n keep the last n daily snapshots
-w, --keep-weekly n keep the last n weekly snapshots
-m, --keep-monthly n keep the last n monthly snapshots
-y, --keep-yearly n keep the last n yearly snapshots
--keep-tag taglist keep snapshots with this taglist (can be specified multiple times) (default [])
- Clean the repo (this will delete all forgotten snapshots)
restic prune
- All-in-one
restic forget <snapshot_id> --prune
Check the repository for inconsistencies
restic check
Crontab job setup
mm hh dom m dow restic backup <my_share>
More info
restic --help
Operator's Guide
Create a Ceph Test Cluster
Create CEPH Cluster
Prepare the hostgroups
- Log to Foreman and create the following hostgroups:
- <my_cluster> and select
ceph
as the parent hostgroup. - For the monitors, create host group
mon
and selectceph/<my_cluster>
as the parent group. - For the osd, create hostgroup
osd
and selectceph/<my_cluster>
as the parent group. - For the metadata servers, create the hostgroup
mds
and selectceph/<my_cluster>
as the parent. - Do the puppet configuration:
- Clone the repo
it-puppet-hostgroup-ceph
- Create the manifests and data files accordingly for the new cluster (use the configuration of other cluster as a base)
- Remember to create a new uuid for the cluster and put it in
/code/hostgroup/ceph/<my_cluster>.yaml
- Commit, push, do merge request, etc ...
- Clone the repo
- <my_cluster> and select
First Monitor Configuration
-
1 ) Create one virtual machine for the first monitor following this guide
-
2 ) Create a mon bootstrap key (from any previous ceph cluster):
ssh root@ceph<existing-cluster>-mon-XXXX ceph-authtool --create-keyring /tmp/keyring.mon --gen-key -n mon. --cap mon 'allow *'
- From aiadm: (maybe you need to ask for permissions to get access to the tbag folder)
mkdir ~/private/tbag/<my_cluster> cd ~/private/tbag/<my_cluster> scp root@ceph<existing_cluster>-mon-XXXX:/tmp/keyring.mon . tbag set --hg ceph/<my_cluster>/mon keyring.mon --file keyring.mon
-
3 ) Now run puppet on the first mon.
puppet agent -t -v
-
4 ) Now copy the admin keyring to tbag (from aiadm):
scp root@<first_mon>:/etc/ceph/keyring . tbag set --hg ceph/<my_cluster> keyring --file keyring
-
5 ) Now create an MGR bootstrap key on the first mon:
ceph auth get-or-create-key client.bootstrap-mgr mon 'allow profile bootstrap-mgr' ceph auth get client.bootstrap-mgr > /tmp/keyring.bootstrap-mgr
- From aiadm:
scp root@<first_mon>:/tmp/keyring.bootstrap-mgr . tbag set --hg ceph/<my_cluster> keyring.bootstrap-mgr --file keyring.bootstrap-mgr
-
6 ) Now create an OSD bootstrap key on the first mon:
ceph auth get-or-create-key client.bootstrap-osd mon 'allow profile bootstrap-osd' ceph auth get client.bootstrap-osd > /tmp/keyring.bootstrap-osd
- From aiadm:
scp root@<first_mon>:/tmp/keyring.bootstrap-osd . tbag set --hg ceph/<my_cluster> keyring.bootstrap-osd --file keyring.bootstrap-osd
Add more Monitors and OSD's
- Follow the step 1) to add more mons and osds. Everything should install correctly.
-
Prepare and activate the OSD
/root/ceph-scripts/ceph-disk/ceph-disk-prepare-all
NOTE: To setup a OSD in the same machine as the monitor.
mkdir /data/a
(for example)chown ceph:ceph -R /data
ceph-disk prepare --filestore /data/a
(ignore the Deprecate warnings)ceph-disk activate /data/a
Creating a CEPH cluster
Table of Contents
- Introduction
- Puppet configuration
- Creating monitor hosts
- Creating manager hosts
- Creating osd hosts
- Creating the first pool
- Finalize cluster configuration
- RBD Clusters
- CephFS Clusters
- S3 Clusters
Follow the below instructions to create a new CEPH cluster in CERN
Prerequisites
- Access to aiadm.cern.ch
- Proper GIT configuration
- Member of ceph administration e-groups
- OpenStack environment configured, link
Introduction - Hostgroups
First, we have to create the hostgroups in which we want to build our cluster in.
The hostgroups provide a layer of abstraction for configuring automatically a
cluster using Puppet. The first group called ceph, ensures that each
machine in this hostgroup has ceph installed, configured and running. The second
group, called first sub-hostgroup, ensures that each machine will communicate
with machines in the same sub-hostgroup forming a cluster. These machines will
have specific configuration defined later in this guide. The second sub-hostgroup
ensures that each machine will act as its corresponding role in the cluster.
For example we first create our cluster's hostgroup with its name that is provided by your task.
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}
As each cluster has its own features, the 2 basic sub-hostgroups for a ceph
cluster are the mon and osd.
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/mon
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/osd
These sub-hostgroups will contain the monitors and the osd hosts.
If the cluster has to use CephFS and/or Rados gateway we need to create the
appropriate sub-hostgroups.
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/mds #for CephFS
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/radosgw #for the rados gateway
Creating a configuration for your new cluster
Go to gitlab.cern.ch and search for it-puppet-hostgroup-ceph
. This repository
contains the configuration for all the machines under the ceph hostgroup. Clone
the repository, create a new branch based on qa, and go to it-puppet-hostgroup-ceph/code/manifests
.
From there, you will create the {hg_name}.pp
file and the {hg_name}
folder.
The {hg_name}.pp
should contain the following code: (replace {hg_name}
with the cluster's name)
class hg_ceph::{hg_name} {
include hg_ceph::include::base
}
This will load the basic configuration for ceph on each machine. The
{hg_name}
folder should contain the *.pp files for the appropriate 2nd
sub-hostgroups.
The files under your cluster's folder will have the following basic format:
File {role}.pp
:
class hg_ceph::{hg_name}::{role} {
include hg_ceph::classes::{role}
}
The include will use a configuration template located in
it-puppet-hostgroup-ceph/code/manifests/classes
The roles are: mon, mgr, osd, mds and radosgw. It is good to run both mon and mgr together, so we usually create the following class e.g.:
class hg_ceph::{hg_name}::mon {
include hg_ceph::classes::mon
include hg_ceph::classes::mgr
}
The following code will configure machines in "ceph/{hg_name}/mon" to act as
monitors and mgrs together. After you are done with creating the needed files
for your task. Your "code/manifests" path should look like this:
# Using kermit as {hg_name}
kermit.pp
kermit/mon.pp
kermit/osd.pp
# Optional, only if requested by the JIRA ticket
kermit/mds.pp
kermit/radosgw.pp
Create a YAML configuration file for the new hostgroup in
it-puppet-hostgroup-ceph/data/hostgroup/ceph
with name {hg_name}.yaml. This
files contains all the basic configuration parameters that are common to all
the nodes in the cluster.
ceph::conf::fsid: d3c77094-4d74-4acc-a2bb-1db1e42bb576
ceph::params::release: octopus
lbalias: ceph{hg_name}.cern.ch
hg_ceph::classes::mon::enable_lbalias: false
hg_ceph::classes::mon::enable_health_cron: true
hg_ceph::classes::mon::enable_sls_cron: true
Where:
ceph::conf::fsid
can be generated byuuid
tool;lbalias
is the alias the mons are part of.
Git add the following files, commit and push your branch. BEFORE you push,
do a git pull --rebase origin qa
to avoid any conflicts with your request.
The command line will provide a link to submit a merge request.
@dvanders is currently the administrator of the repo, so you should assign him the task to check your request and eventually merge it.
Creating your first monitor node
Follow the instructions to create exactly one monitor here.
DO NOT ADD more than one machines to the ceph/{hg_name}/mon
hostgroup,
otherwise your first monitor will always deadlock and you will need to remove
the others and rebuild the first one again.
With TBag authentication
Once we are able to login to the node, we will need to create the keys to be
able to bootstrap new nodes to the cluster. We will first have to create the
inital key, so mons can be created in our new cluster.
[root@ceph{hg_name}-mon-...]$ ceph-authtool --create-keyring /tmp/keyring.mon --gen-key -n mon. --cap mon 'allow *'
Login to aiadm, copy the key from the monitor host and store it on tbag.
[user@aiadm]$ mkdir -p ~/private/tbag/{hg_name}
[user@aiadm]$ cd ~/private/tbag/{hg_name}
[user@aiadm]$ scp {mon_host}:/tmp/keyring.mon .
[user@aiadm]$ tbag set --hg ceph/{hg_name}/mon keyring.mon --file keyring.mon
Login to your mon host and run puppet puppet agent -t
, repeat until you see a running ceph-mon process.
Run the following to disable some warning and enable some features for ceph:
[root@ceph{hg_name}-mon-...]$ ceph mon enable-msgr2
[root@ceph{hg_name}-mon-...]$ ceph osd set-require-min-compat-client luminous
[root@ceph{hg_name}-mon-...]$ ceph config set mon auth_allow_insecure_global_id_reclaim false
Note that enable-msgr2
will need to be run again after all mons have been created.
We will need to repeat this procedure for the mgr, osd, mds, rgw and rbd-mirror depending on what we need:
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-mgr mon 'allow profile bootstrap-mgr'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-mgr > /tmp/keyring.bootstrap-mgr
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-osd mon 'allow profile bootstrap-osd'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-osd > /tmp/keyring.bootstrap-osd
# Optional, only if the cluster uses CephFS
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-mds mon 'allow profile bootstrap-mds'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-mds > /tmp/keyring.bootstrap-mds
# Optional, only if the cluster uses a Rados Gateway
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.bootstrap-rgw mon 'allow profile bootstrap-rgw' -o /tmp/keyring.bootstrap-rgw
# Optional, only if the cluster uses a rbd-mirror
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-rbd-mirror -o /tmp/keyring.bootstrap-rbd-mirror
Login to aiadm, copy the keys from the monitor host and use them with tbag.
Make sure you don't have any excess keys in the /tmp
folder (5 max, mon/mgr/osd/mds/rgw).
We don't need to provide the specific subgroup for each key, because that will cause confusion,
"ceph/{hg_name}" is enough.
[user@aiadm]$ cd ~/private/tbag/{hg_name}
[user@aiadm]$ scp {mon_host}:/tmp/keyring.* .
[user@aiadm]$ scp {mon_host}:/etc/ceph/keyring .
[user@aiadm]$ for file in *; do tbag set --hg ceph/{hg_name} $file --file $file; done
# Make sure to copy all the generated keys on `/mnt/projectspace/tbag` of `cephadm.cern.ch` as well:
[user@aiadm]$ scp -r . root@cephadm:/mnt/projectspace/tbag/{hg_name}
Now we create the other monitors using the same procedure as the first one using ai-bs
.
The other monitors will be configured automatically.
Creating manager hosts
The procedure is very similar to the one for the creation of mons:
- Create new VMs;
- Add them to the ceph/{hg_name}/mgr hostgroup;
- Set the right roger state for the new VMs;
Instructions for the creation of mons still hold here, with the necessary changes for mgrs.
As stated above, in some cases it is necessary to colocate mons and mgs. If so, it is not needed to create new machines for mgrs but simply include the mgr class in the mon manifest:
class hg_ceph::{hg_name}::mon {
include hg_ceph::classes::mon
include hg_ceph::classes::mgr
}
Creating osd hosts
The OSD hosts will be usually given to you to be prepared by formatting the disks
and adding them to the cluster. The tool used to format the disks will be ceph-volume
.
The provision will happen with lvm. Make sure your disks are empty, run pvs
and
vgs
to check if they have any lvm data.
We can safely ignore the system disks in case they are used with lvm. On every
host run ceph-volume lvm zap {disk} --destroy
to zap the disks and remove any
lvm data. In case your hosts contain only one type of disk like HDD or SSD for
OSDS we can run the following command for the provision of our OSDS:
# It works like the ls command, if we need to create OSDS from /dev/sdc to /dev/sdz we can try this
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sd[c-z]
You will be prompted to check the OSD creation plan and if you agree with the
following changes you can input yes to create the OSDS. If you are trying to
automate this task you can pass the --yes
parameter to the ceph-volume lvm batch
command. In the case you have SSDs to back the HDDs to create hybrid OSDs using
SDD block.DB and HDD block.data you will have to run the above command per SSD:
# 2 SSDs sda sdb 4HDDs sdc sdd sde sdf
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sda /dev/sdc /dev/sdd
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sdb /dev/sde /dev/sdf
The problem with the current lvm batch implementation is that it creates a single
volume group for the block.DB part. Therefore, when an SSD fails, the whole set
of OSDs in the host become corrupted. So in order to minimize the cost, we run batch per SSD.
Run ceph osd tree
to check whether the OSDs are placed correctly in the tree.
If the OSDs are not set as described with grep ^crush /etc/ceph/ceph.conf
you
will need to remove the line containing something like update crush on start
and restart the OSDs of that host.
You can also create/move/delete buckets with (examples):
ceph osd crush add-bucket CK13 rack
ceph osd crush move CK13 room=0513-R-0050
ceph osd crush move 0513-R-0050 root=default
ceph osd crush move cephflash21a-ff5578c275 rack=CK13
Now you are one step away from having a functional cluster.
Next step is to create a pool so we can be able to use the storage of our cluster.
Creating the first pool
A pool in ceph is the root namespace of an object store system. A pool has its
own data redudancy schema and access permissions. In the case cephfs is used, two
pools are created, one for data and one for metadata, or in the case to support
openstack various pools are created for storing images and volumes and shares.
To create a pool we first have to understand what type of data redundancy we
should use: replicated or EC. If the task already defines what should happen,
then you can go to the ceph documentation:
BEFORE you create a pool you first need to create a CRUSH rule that matches
to your cluster's schema:
You can get the schema by running ceph osd tree | less
.
As an example, the meredith cluster runs with 4+2 EC and the failure domain is rack. Create the required erasure-code-profile with:
[root@cephmeredithmon...]$ ceph osd erasure-code-profile ls
default
[root@cephmeredithmon...]$ ceph osd erasure-code-profile set jera_4plus2
[root@cephmeredithmon...]$ ceph osd erasure-code-profile get jera_4plus2
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8
[root@cephmeredithmon...]$ ceph osd erasure-code-profile set jera_4plus2 k=4 m=2 crush-failure-domain=rack --force
[root@cephmeredithmon...]$ ceph osd erasure-code-profile get jera_4plus2
crush-device-class=
crush-failure-domain=rack
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8
NEVER modify an existing profile. That would change the data placement on disk!
Here we use the --force
flag only because the new jera_4plus2
is not used yet.
Now create a CRUSH rule with the defined profile:
[root@cephmeredithmon...]$ ceph osd crush rule create-erasure rack_ec jera_4plus2
created rule rack_ec at 1
[root@cephmeredithmon...]$ ceph osd crush rule ls
replicated_rule
rack_ec
[root@cephmeredithmon...]$ ceph osd crush rule dump rack_ec
{
"rule_id": 1,
"rule_name": "rack_ec",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 6,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "rack"
},
{
"op": "emit"
}
]
}
The last thing that is left is to calculate the number of PGs to keep the cluster running optimally.
The Ceph developers reccomend 30 to 100 PGs per OSD, keep in mind that the data
redundancy schema counts as a multiplier. For example, if you have 100 OSDs you
will need at least 3K to 10K PGs. The number of the PGs must be a power of
two. So, we will use at least 1024(x3) to 2048(x3) PGs on the pool creation
command. Keep in mind that there may be a need for additional pools, such as
"test" which is created on every cluster for the profound reason of testing.
In general the formula is the following:
MaxPGs = \begin{cases}
NumOSDs*100/ReplicationSize &\text{if } replicated \\
NumOSDs*100/(k+m) &\text{if } erasure\ coded
\end{cases}
Then we use the closest power of two, which is less than the above number.
Example on meredith (368 OSDs, EC -- k=4, m=2): MaxPGs=6133 --> MaxPGs=4096
Now, let's create the pools following the upstream documentation Create a pool.
We should have at least one test pool and one data pool:
-
Create the test pool. It should always be
replicated
and not EC:[root@cephmeredithmon...]$ ceph osd pool create test 512 512 replicated replicated_rule pool 'test' created [root@cephmeredithmon...]$ ceph osd pool ls detail pool 6 'test' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 1710 flags hashpspool stripe_width 0 application test
-
Create the data pool (named 'rbd_ec_data' here) with EC:
[root@cephmeredithmon...]$ ceph osd pool create rbd_ec_data 4096 4096 erasure jera_4plus2 rbd_ec_data pool 'rbd_ec_data' created [root@cephmeredithmon...]$ ceph osd pool ls detail | grep rbd_ec_data pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 autoscale_mode warn last_change 1554 flags hashpspool stripe_width 16384
Finalize cluster configuration
Security Flags on Pools
- Make sure the security flags {nodelete, nopgchange, nosizechange} are set for all the pools
[root@cluster_mon]$ ceph osd pool ls detail
pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 1711 lfor 0/0/1559 flags hashpspool,ec_overwrites,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 16384 application rbd
...
If not, set the flags with
[root@cluster_mon]$ ceph osd pool set <pool_name> {nodelete, nopgchange, nosizechange} 1
pg_autoscale_mode
should be set tooff
[root@cluster_mon]$ ceph osd pool ls detail
pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 1985 lfor 0/0/1559 flags hashpspool,ec_overwrites,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 16384 application rbd
If the output shows anything for autoscale_mode
, disable autoscaling with
[root@cluster_mon]$ ceph osd pool set <pool_name> pg_autoscale_mode off
- Set the application type for each pool in the cluster
[root@cluster_mon]$ ceph osd pool application enable my_test_pool test
[root@cluster_mon]$ ceph osd pool application enable my_rbd_pool rbd
- If relevant, enable the balancer
[root@cluster_mon]$ ceph balancer on
[root@cluster_mon]$ ceph balancer mode upmap
[root@cluster_mon]$ ceph config set mgr mgr/balancer/upmap_max_deviation 1
The parameter upmap_max_deviation
is used to spread the PGs more evenly across the OSDs.
Check with
[root@cluster_mon]$ ceph balancer status
{
"plans": [],
"active": true,
"last_optimize_started": "Tue Jan 12 16:47:48 2021",
"last_optimize_duration": "0:00:00.296960",
"optimize_result": "Optimization plan created successfully",
"mode": "upmap"
}
[root@cluster_mon]$ ceph config dump
WHO MASK LEVEL OPTION VALUE RO
mgr advanced mgr/balancer/active true
mgr advanced mgr/balancer/mode upmap
mgr advanced mgr/balancer/upmap_max_deviation 1
Also, after quite some time spent balancing, the number of PGs per OSD should be even.
Focus on the PGS
column of the output of ceph osd df tree
[root@cluster_mon]$ ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 642.74780 - 643 TiB 414 GiB 46 GiB 505 KiB 368 GiB 642 TiB 0.06 1.00 - root default
-5 642.74780 - 643 TiB 414 GiB 46 GiB 505 KiB 368 GiB 642 TiB 0.06 1.00 - room 0513-R-0050
-4 27.94556 - 28 TiB 18 GiB 2.0 GiB 0 B 16 GiB 28 TiB 0.06 1.00 - rack CK01
-3 27.94556 - 28 TiB 18 GiB 2.0 GiB 0 B 16 GiB 28 TiB 0.06 1.00 - host cephflash21a-04f5dd1763
0 ssd 1.74660 1.00000 1.7 TiB 1.1 GiB 127 MiB 0 B 1 GiB 1.7 TiB 0.06 1.00 75 up osd.0
1 ssd 1.74660 1.00000 1.7 TiB 1.1 GiB 127 MiB 0 B 1 GiB 1.7 TiB 0.06 1.00 69 up osd.1
2 ssd 1.74660 1.00000 1.7 TiB 1.1 GiB 127 MiB 0 B 1 GiB 1.7 TiB 0.06 1.00 72 up osd.2
3 ssd 1.74660 1.00000 1.7 TiB 1.1 GiB 127 MiB 0 B 1 GiB 1.7 TiB 0.06 1.00 70 up osd.3
Monitoring
Cluster monitoring is offered by:
- Health crons enabled at the hostgroup level (see the YAML file above):
enable_health_cron
enables sending the email report that checks the current health status and greps in recentceph.log
enable_sls_cron
enables sending metrics to filer-carbon that populate the Ceph Health dashboard
- Regular polling performed by
cephadm.cern.ch
- Prometheus
- Watcher clients (CephFS) that mount and test FS availability
To enable polling from cephadm, proceed as follows:
- Add the new cluster to
it-puppet-hostgroup-ceph/code/manifest/admin.pp
. Consider Admin newclusters as reference merge request. (note, if you are adding a cephFS cluster, you do not need to add it to the### BASIC CEPH CLIENTS
Array. - Create a
client.admin
key on the cluster
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.admin mon 'allow *' mgr 'allow *' osd 'allow *' mds 'allow *'
[client.admin]
key = <the_super_secret_key>
- Add the key to tbag in the
ceph/admin
hostgroup (the secret must contain the full output of the command above)
tbag set --hg ceph/admin <cluster_name>.keyring --file <keyring_filename>
tbag set --hg ceph/admin <cluster_name>.admin.secret
Enter Secret: <paste secret here>
- Add the new cluster to
it-puppet-module-ceph/data/ceph.yaml
otherwise the clients (cephadm
included) will lack the mon hostname. (Consider Add ryan cluster as a reference merge request.) Double check you are using the appropriate port. - ssh to
cephadm
and run puppet a couple of times - Make sure files at
<cluster_name>.client.admin.keyring
and at<cluster_name>.conf
exist and show the appropriate content - Check the health of the cluster with
[root@cephadm]# ceph --cluster=<cluster_name> health
HEALTH_OK
- Cephadm is also resposnbile for producting the availability numbers sent to the central IT Service Availability Overview. If the cluster needs to be reported in IT SAO, add it to ceph-availability-producer.py with a relevant description.
To enable monitoring from Prometheus, add the new cluster to prometheus.yaml. Also, the Prometheus module must be enabled on the MGR (Documentation: https://docs.ceph.com/en/octopus/mgr/prometheus/) for metrics to be retrieved:
ceph mgr module enable prometheus
To ensure a CephFS cluster is represented adequetely, there are some unique steps we must take:
- Update the it-puppet-module-cephfs
README.md
andcode/data/common.yaml
to include the new cluster (Consider add doyle cluster as a reference merge request.) - Update
it-puppet-hostgroup-ceph
watchers definition incode/manfiests/test/cepfs/watchers.pp
to ensure the new cluster is mounted by the watchers. (consider watchers.pp: add doyle definition an example merge request) - SSH to one of the watcher nodes (e.g.
cephfs-testc9-d81171f572.cern.ch
) and run puppet a few times to synchronise the changes. - Checking
cat /proc/mounts | grep ceph
for an appropriate systemd mount and navigating to one of the directories within/
let you examine if the FS is availible.
Details on lbalias for mons
We prefer not to use load-balancing service and lbclient
here (https://configdocs.web.cern.ch/dnslb/).
There is no scenario in ceph where we want a mon to disappear from the alias.
We rather use the --load-N-
appoarch to create the alias with all the mons:
- Go to
network.cern.ch
- Click on
Update information
and use the FQDN of the mon machine- If prompted, make sure you host interface and not the IPMI one
- Add "ceph{hg_name}--LOAD-N-" to the list IP Aliases under TCP/IP Interface Information
- Multiple aliases are supported. Use a comma-separated list
- Check the changes are correct and submit the request
Benchmarking
Note: What follows is not proper benchmarking but some quick hints the cluster works as expected.
Good reading at Benchmarking performance
Rados bench
Start a test on pool 'my_test_pool' with 30s duration and blockize 4096 B
[root@cluster_mon]$ rados bench -p my_test_pool 10 write -b 4096
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cephflash21a-a6564a2ee7.cern._1768589
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 8752 8736 34.1231 34.125 0.00130825 0.00182201
2 16 16913 16897 32.9995 31.8789 0.00104112 0.00189076
3 15 24678 24663 32.1108 30.3359 0.00139087 0.00194522
4 16 32189 32173 31.4167 29.3359 0.0209055 0.0019863
5 16 39595 39579 30.9187 28.9297 0.0209981 0.00201906
6 16 47263 47247 30.7573 29.9531 0.00138272 0.00203065
7 16 55169 55153 30.7748 30.8828 0.00121337 0.00202973
8 16 63070 63054 30.7855 30.8633 0.00133439 0.00202877
9 15 70408 70393 30.55 28.668 0.00144124 0.00204461
10 11 78679 78668 30.7271 32.3242 0.00162555 0.00203309
Total time run: 10.0178
Total writes made: 78679
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 30.6793
Stddev Bandwidth: 1.68734
Max bandwidth (MB/sec): 34.125
Min bandwidth (MB/sec): 28.668
Average IOPS: 7853
Stddev IOPS: 431.959
Max IOPS: 8736
Min IOPS: 7339
Average Latency(s): 0.00203504
Stddev Latency(s): 0.00370041
Max latency(s): 0.0702117
Min latency(s): 0.000887922
Cleaning up (deleting benchmark objects)
Removed 78679 objects
Clean up completed and total clean up time :4.93871
RBD bench
Create a RBD image and run some tests on it
[root@cluster_mon]$ rbd create rbd_ec_meta/enricotest --size 100G --data-pool rbd_ec_data
[root@cluster_mon]$ rbd bench --io-type write rbd_ec_meta/enricotest --io-size 4M --io-total 100G
Once done, delete the image with
[root@cluster_mon]$ rbd ls -p rbd_ec_meta
[root@cluster_mon]$ rbd rm rbd_ec_meta/enricotest
RBD clusters
Create Cinder key for use with OpenStack
All of the above steps bring to a fully functional Rados Block cluster. The only missing step is to create access keys for the OpenStack Cinder so that it can use the provided storage.
The upstream documentation on user management (and OpenStack is a user) is available at User Management
To create the relevant access key for OpenStack use the following command:
$ ceph auth get-or-create client.cinder mon 'profile rbd' osd 'profile rbd pool=volumes' mgr 'profile rbd pool=volumes'
which results in creating a user named "cinder" to run rbd commands on the pool named "volumes".
Create an Images
pool for use with OpenStack Glance
To store Glance images on ceph, a dedicated pool (pg_num may vary) and cephx keys are needed:
$ ceph osd pool create images 128 128 replica replicated_rule
$ ceph auth get-or-create client.images mon 'profile rbd' mgr 'profile rbd pool=images' osd 'profile rbd pool=images'
CephFS Clusters
Enabling CephFS consists of creating data and metadata pools for CephFS and a new filesystem.
It is also needed to create metadata servers (either dedicated or colocated with other daemons), else the cluster will show HEALTH_ERR
and 1 filesystem offline. See below for the creation of metadata servers.
Follow the upstream documentation at Create a Ceph File System
Creating metadata servers
Create at least two hosts to ceph/{hg_name}/mds
.
MDS daemons can be dedicated (preferable for large, busy clusters) or colocated with other daemons (e.g., on the osd host, assuming enough memory is available).
As soon as one MDS goes active, the cluster health will go back to HEALTH_OK
.
It is recommended to have at least 2 nodes running MDSes for failover.
One can also consider to have a stand-by replay MDS to lower the time needed for a failover.
Create Manila key for use with OpenStack
To provision CephFS File Shares via OpenStack Manila, a dedicated cephx must be provided to the OpenStack team. Create the key with:
$ ceph auth get-or-create client.manila mon 'allow r' mgr 'allow rw'
S3 Clusters
Creating rgw hosts
To provide object storage, it is needed to run Ceph Object Gateway daemons (radosgw
).
RGWs can run on dedicated machines (by creating new hosts in hostgroup ceph/{hg_name}/rgw
) or colocated with existing machines.
In both cases, these classes need to be enabled:
- The
radosgw
class radosgw.pp - The
lb
class lb.pp - The
traefik
class traefik.pp
Also, you may want to enable:
- The S3 crons for specific quota and health checks (see
include/s3{hourly,daily,weekly}.pp
- Traefik log ingestion into the MONIT pipelines for ElasticSeach dashoboards (see s3-logging).
Always start with one RGW only and iterate over the configuration until it runs.
Some of the required data pools (default.rgw.control
, default.rgw.meta
, default.rgw.log
, .rgw.root
)
are automatically created by the RGW at its first run. The creation of some other pools
is triggered by specific actions, e.g., making a bucket will create pool default.rgw.buckets.index
,
pushing the first object will trigger creation of default.rgw.buckets.data
.
It is highly recommended to pre-create all pools so that they have the right cursh rule,
pg_num, etc. before data is written to them. If they get auto-created, they will use
the default crush type (replicated
), while we typically use erasure coding for object storage.
Use an existing clusters as reference to configure pools.
Creating a DNS load-balanced alias
The round-robin based DNS load balancing service is describe at DNS Load Balancing.
To create a new load-balanced alias for S3:
- Go to https://aiermis.cern.ch/
Add LB Alias
by specifying if it needs to be external and the number of hosts to return (Best Hosts)- Configure
hg_ceph::classes::lb::lbalias
and the relevant RadosGW configuration params accordingly (rgw dns name
,rgw dns s3website name
,rgw swift url
. ...) - To support virtual host style bucket address (i.e.,
mybucket.s3.cern.ch
) talk to the Network Team to have wildcard DNS enabled on the alias
Integration with OpenStack Keystone
RBD Mirroring
Make sure you have included hg_ceph::classes:rbd_mirror
and set up the
bootstrap-rbd-mirror keyring.
Adding peers to rbd-mirror
You first have to add a rbd-mirror-peer keyring in the hostgroup ceph.
First get to your mon and run the following command:
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.rbd-mirror-peer mon 'profile rbd-mirror-peer' osd 'profile rbd' -o {hg_name}.client.rbd-mirror-peer.keyring
Copy the keyring to aiadm and create the secret:
[user@aiadm]$ tbag set --hg ceph {hg_name}.client.rbd-mirror-peer.keyring --file {hg_name}.client.rbd-mirror-peer.keyring
Now your cluster can participate with the others already registered to mirror your RBD images! You can now add the following data to registers peers for your rbd-mirror daemons:
ceph::rbd_mirror:
- peer1
- peer2
- ...
Peerings pools
You first have to enable the mirroring of some of your pools: https://docs.ceph.com/en/octopus/rbd/rbd-mirroring/#enable-mirroring. Also check the configuration of those modes in the same page (journaling feature enabled on the RBD images, image snapshot settings, ...).
And then you can add peers like this:
[root@ceph{hg_name}-rbd-mirror-...]$ rbd mirror pool peer add {pool} client.rbd-mirror-peer@{remote_peer}
What to watch?
There are several channels to watch during your Rota shift:
-
Emails to ceph-admins@cern.ch:
- "Ceph Health Warn" mails.
- SNOW tickets from IT Repair Service.
- Prometheus Alerts.
-
SNOW tickets assigned to Ceph Service:
- Here is a link to the tickets needing to be taken: Ceph Assigned
-
Ceph Internal Mattermost channel
-
General informations on clusters (configurations, OSD types, HW, versions): Instance Version Tracking ticket
Taking notes
Each action you take should be noted down in a journal, which is to be linked or attached to the minutes of theCeph weekly meeting the following week. https://indico.cern.ch/category/9250/ Use HackMD, Notepad, ...
Keeping the Team Informed
If you have any questions or take any significant actions, keep you colleagues informed in Mattermost
Common Procedures
- scsi_blockdevice_driver_error_reported
- CephInconsistentPGs
- Ceph PG Unfound
- CephTargetDown
- SSD Replacement
- MDS Slow Ops
- Large omap Objects
exception.scsi_blockdevice_driver_error_reported
Draining a Failing OSD
The IT Repair Service may ask ceph-admins to prepare a disk to be physically removed.
The scripts needed for the replacement procedure may be found under ceph-scripts/tools/ceph-disk-replacement/
.
For failing OSDs in wigner cluster, contact ceph-admins
-
watch ceph status
<- keep this open in a separate window. -
Login to the machine with a failing drive and run
./drain-osd.sh --dev /dev/sdX
(the ticket should tell which drive is failing)- For machines in /ceph/erin/osd/castor: You cannot run the script, ask ceph-admins.
- If the output is of the following form: Take notes of the OSD id
<id>
ceph osd out osd.<id>
- Else
- If the script shows no output: Ceph is unhealthy or OSD is unsafe to stop, contact ceph-admins
- Else if the script shows a broken output (especially missing
<id>
): Contact ceph-admins
-
Run
./drain-osd.sh --dev /dev/sdX | sh
-
Once drained (can take a few hours), we now want to prepare the disk for replacement
- Run
./prepare-for-replacement.sh --dev /dev/sdX
- Continue if the output is of the following form and that the OSD id
<id>
displayed is consistent with what was given by the previous command:
systemctl stop ceph-osd@<id> umount /var/lib/ceph/osd/ceph-<id> ceph-volume lvm zap /dev/sdX --destroy
-
(note that the
--destroy
flag will be dropped in case of a FileStore OSD) -
Else
- If the script shows no output: Ceph is unhealthy or OSD is unsafe to stop, contact ceph-admins
- Else if the script shows a broken output (especially missing
<id>
): Contact ceph-admins
- Run
-
Run
./prepare-for-replacement.sh --dev /dev/sdX | sh
to execute. -
Now the disk is safe to be physically removed.
- Notify the repair team in the ticket
Creating a new OSD (on a replacement disk)
When the IT Repair Service has replaced the broken disk with a new one, we have to format that disk with BlueStore to add it back to the cluster:
-
watch ceph status
<- keep this open in a separate window. -
Identify the osd id
to use on this OSD: - Check your notes from the drain procedure above.
- Cross-check with
ceph osd tree down
<-- look for the down osd on this host, should match your notes.
-
Run
./recreate-osd.sh --dev /dev/sdX
and check that the output is according to the following:
- On beesly cluster:
ceph-volume lvm zap /dev/sdX
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db /dev/sdY
- On gabe cluster:
ceph-volume lvm zap /dev/sdX
ceph-volume lvm zap /dev/ceph-block-dbs-[0-6a-f]+/osd-block-db-[0-6a-z-]+
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db ceph-block-dbs-[0-6a-f]+/osd-block-db-[0-6a-z-]+
-
On erin cluster:
- Regular case:
ceph-volume lvm zap /dev/sdX ceph osd destroy <id> --yes-i-really-mean-it ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db /dev/sdY
- ceph/erin/castor/osd
- Script cannot be run, contact ceph-admins.
- If the output is satisfactory, run
./recreate-osd.sh --dev /dev/sdX | sh
See OSD Replacement for many more details.
CephInconsistentPGs
Familiarize yourself with the Upstream documentation
Check ceph.log on a ceph/*/mon machine to find the original "cluster [ERR]" line.
The inconsistent PGs generally come in two types:
- deep-scrub: stat mismatch, solution is to repair the PG
- Here is an example on
ceph\flax
:
- Here is an example on
2019-02-17 16:23:05.393557 osd.60 osd.60 128.142.161.220:6831/3872729 56 : cluster [ERR] 1.85 deep-scrub : stat mismatch, got 149749/149749 objects, 0/0 clones, 149749/149749 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 135303283738/135303284584 bytes, 0/0 hit_set_archive bytes.
2019-02-17 16:23:05.393566 osd.60 osd.60 128.142.161.220:6831/3872729 57 : cluster [ERR] 1.85 deep-scrub 1 errors
- candidate had a read error, solution follows below.
- Notice that the doc says If read_error is listed in the errors attribute of a shard, the inconsistency is likely due to disk errors. You might want to check your disk used by that OSD. This is indeed the most common scenario.
Handle a failing disk
In this case, a failing disk returns bogus data during deep scrubbing, and ceph will notice that the replicas are not all consistent with each other. The correct procedure is therefore to remove the failing disk from the cluster, let the PGs backfill, then finally to deep-scrub the inconsistent PG once again.
Here is an example on ceph/erin
cluster, where the monitoring has told us that PG 64.657c
has an inconsistent PG:
[09:16][root@cepherin0 (production:ceph/erin/mon*1) ~] grep shard /var/log/ceph/ceph.log
2017-04-12 06:34:26.763000 osd.508 128.142.25.116:6924/4070422 4602 : cluster [ERR] 64.657c shard 187:
soid 64:3ea78883:::1568573986@castorns.27153415189.0000000000000034:head candidate had a read error
A shard in this case refers to which OSD has the inconsistent object replica, in this case it's the "osd.187".
Where is osd.187?
[09:16][root@cepherin0 (production:ceph/erin/mon*1) ~]# ceph osd find 187
{
"osd": 187,
"ip": "128.142.25.106:6820\/530456",
"crush_location": {
"host": "p05972678k94093",
"rack": "EC06",
"room": "0513-R-0050",
"root": "default",
"row": "EC"
}
}
On the p05972678k94093
host we first need to find out which /dev/sd* device hosts that osd.187.
On BlueStore OSDs we need to check with ceph-volume lvm list
or lvs
:
[14:38][root@p05972678e32155 (production:ceph/erin/osd*30) ~]# lvs -o +devices,tags | grep 187
osd-block-... ceph-... -wi-ao---- <5.46t /dev/sdm(0) ....,ceph.osd_id=187,....
So we know the failed drive is /dev/sdm
, now we can check for disk Medium errors:
[09:16][root@p05972678k94093 (production:ceph/erin/osd*29) ~]# grep sdm /var/log/messages
[Wed Apr 12 12:27:59 2017] sd 1:0:10:0: [sdm] CDB: Read(16) 88 00 00 00 00 00 05 67 07 20 00 00 04 00 00 00
[Wed Apr 12 12:27:59 2017] blk_update_request: critical medium error, dev sdm, sector 90638112
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] Sense Key : Medium Error [current]
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] Add. Sense: Unrecovered read error
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] CDB: Read(16) 88 00 00 00 00 00 05 67 07 20 00 00 00 08 00 00
[Wed Apr 12 12:28:02 2017] blk_update_request: critical medium error, dev sdm, sector 90638112
In this case, the disk is clearly failing.
Now check if that osd is safe to stop?
[14:41][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# ceph osd ok-to-stop osd.187
OSD(s) 187 are ok to stop without reducing availability, provided there are no other concurrent failures or interventions. 182 PGs are likely to be degraded (but remain available) as a result.
Since it is OK, we stop the osd, umount it, and mark it out.
[09:17][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# systemctl stop ceph-osd@187.service
[09:17][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# umount /var/lib/ceph/osd/ceph-187
[09:17][root@p05972678k94093 (production:ceph/erin/osd*29) ~]# ceph osd out 187
marked out osd.187.
ceph status
should now show the PG is in a state like this:
1 active+undersized+degraded+remapped+inconsistent+backfilling
It can take a few 10s of minutes to backfill the degraded PG.
Repairing a PG
Once the inconsistent PG is no longer "undersized" or "degraded", use the script at ceph-scripts/tools/scrubbing/autorepair.sh
to repair the PG and start the scubbing immediately.
Now check ceph status
... You should see the scrubbing+repair
started already on the inconsistent PG.
Ceph PG Unfound
The PG unfound condition may be due to a race condition when PGs are scrubbed (see https://tracker.ceph.com/issues/51194) leading to PG reported as recovery_unfound
.
Upstream documentation is available for general unfound objects
In case of unfound objects, ceph reports a HEALTH_ERR condition
# ceph -s
cluster:
id: 687634f1-03b7-415b-aff9-e21e6bedbe7c
health: HEALTH_ERR
1/282983194 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 3/848949582 objects degraded (0.000%), 1 pg degraded
services:
mon: 3 daemons, quorum cephdata20-4675e5a59e,cephdata20-44bdbfa86f,cephdata20-83e1d8a16e (age 4h)
mgr: cephdata20-83e1d8a16e(active, since 11w), standbys: cephdata20-4675e5a59e, cephdata20-44bdbfa86f
osd: 576 osds: 575 up (since 9d), 573 in (since 9d)
data:
pools: 3 pools, 17409 pgs
objects: 282.98M objects, 1.1 PiB
usage: 3.2 PiB used, 3.0 PiB / 6.2 PiB avail
pgs: 3/848949582 objects degraded (0.000%)
1/282983194 objects unfound (0.000%)
17342 active+clean
60 active+clean+scrubbing+deep
6 active+clean+scrubbing
1 active+recovery_unfound+degraded
List the PGs in recovery_unfound
state
# ceph pg ls recovery_unfound
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
1.2d09 17232 3 0 1 72106876434 0 0 3373 active+recovery_unfound+degraded 37m 399723'3926620 399723:23220581 [574,671,662]p574 [574,671,662]p574 2023-01-12T13:27:34.752832+0100 2023-01-12T13:27:34.752832+0100
Check the ceph log (cat /var/log/ceph/ceph.log | grep ERR
) for IO errors on the primary OSD of the PG. In this case, the disk backing osd.574 is failing with pending sectors (check with smartctl -a <device>
)
2023-01-12T13:07:39.543780+0100 osd.574 (osd.574) 775 : cluster [ERR] 1.2d09 shard 574 soid 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head : candidate had a read error
2023-01-12T13:27:34.752327+0100 osd.574 (osd.574) 776 : cluster [ERR] 1.2d09 deep-scrub 0 missing, 1 inconsistent objects
2023-01-12T13:27:34.752830+0100 osd.574 (osd.574) 777 : cluster [ERR] 1.2d09 repair 1 errors, 1 fixed
2023-01-12T13:27:34.783419+0100 osd.671 (osd.671) 3686 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:34.809509+0100 osd.662 (osd.662) 4313 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:39.196815+0100 mgr.cephdata20-83e1d8a16e (mgr.340798757) 3573768 : cluster [DBG] pgmap v3244401: 17409 pgs: 1 active+recovery_unfound+degraded, 12 active+clean+scrubbing, 62 active+clean+scrubbing+deep, 17334 active+clean; 1.1 PiB data, 3.2 PiB used, 3.0 PiB / 6.2 PiB avail; 224 MiB/s rd, 340 MiB/s wr, 4.83k op/s; 3/848927301 objects degraded (0.000%); 1/282975767 objects unfound (0.000%)
Before taking any action, make sure that the version of the objected reported as unfound on the other two OSDs are more recent than the lost one:
- List unfound object
# ceph pg 1.2d09 list_unfound { "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "rbd_data.0bee1ae64c9012.00000000000032c4", "key": "", "snapid": -2, "hash": 2152017161, "max": 0, "pool": 1, "namespace": "" }, "need": "399702'3923004", "have": "0'0", "flags": "none", "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1", "locations": [] } ], "state": "NotRecovering", "available_might_have_unfound": true, "might_have_unfound": [], "more": false
- The missing object is at version
399702
- Last osd map before read error: e399704
2023-01-12T13:07:24.463521+0100 mon.cephdata20-4675e5a59e (mon.0) 2714279 : cluster [DBG] osdmap e399704: 576 total, 575 up, 573 in 2023-01-12T13:07:39.543780+0100 osd.574 (osd.574) 775 : cluster [ERR] 1.2d09 shard 574 soid 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head : candidate had a read error
- The object goes unfound at: e399710
2023-01-12T13:27:30.297813+0100 mon.cephdata20-4675e5a59e (mon.0) 2714933 : cluster [DBG] osdmap e399710: 576 total, 575 up, 573 in 2023-01-12T13:27:39.196815+0100 mgr.cephdata20-83e1d8a16e (mgr.340798757) 3573768 : cluster [DBG] pgmap v3244401: 17409 pgs: 1 active+recovery_unfound+degraded, 12 active+clean+scrubbing, 62 active+clean+scrubbing+deep, 17334 active+clean; 1.1 PiB data, 3.2 PiB used, 3.0 PiB / 6.2 PiB avail; 224 MiB/s rd, 340 MiB/s wr, 4.83k op/s; 3/848927301 objects degraded (0.000%); 1/282975767 objects unfound (0.000%)
- The two copies on 671 and 662 are more recent --
399702 VS 399709
:2023-01-12T13:27:34.783419+0100 osd.671 (osd.671) 3686 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819 2023-01-12T13:27:34.809509+0100 osd.662 (osd.662) 4313 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
If copies are more recent than the lost one:
- Set the primary osd (
osd.574
) out - The
recovery_unfound
object disappears and backfilling start - Once backfilled, deep-scrub the PG to check for inconsistencies
CephTargetDown
This is an special alert raised by prometheus. This indicates that for whatever reason a target
node is not exposing
its metrics anymore or prometheus server is not able to pull them. This does not imply that the node is offline, just
that the node endpoint is down for prometheus.
To handle this tickets first we need to identify which is the affected target. This information should be in the ticket body.
The following Alerts are in Firing Status:
------------------------------------------------
Target cephpolbo-mon-0.cern.ch:9100 is down
Target cephpolbo-mon-2.cern.ch:9100 is down
Alert Details:
------------------------------------------------
Alertname: TargetDown
Cluster: polbo
Job: node
Monitor: cern
Replica: A
Severity: warning
After, we can go to the target
section in prometheus's dashboard and cross-check
the affected node. There you can find more information about the reason of being down.
This could be caused by the following reasons:
- A node is offline or it's being restarted. Follow the normal procedures for understanding why the node is not online (ping, ssh, console access, SNOW ticket search...). Once the node is back, the target should be marked as UP again automatically.
- If a new target was added recently, possibly there are mistakes in the target definition or some conectivity problems
like the port being blocked.
- Review the target configuration in
it-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml
and refer to the monitoring guide. - Make sure that the firewall configuration allows prometheus to scrape the data through the specified port.
- Review the target configuration in
- In ceph, the daemons that expose the metrics are the
mgr
. Sometimes, could happen that the mgr hangs and then stop exposing the metrics.- Check the
mgr
status and eventually restart it. Don't forget to collect information about the state in what you found it for further analysis. If all went well, after 30 seconds, the target should beUP
again in prometheus dashboard. For double-check you can click in theendpoint
url of the node and see if the metrics are now shown.
- Check the
SSD Replacement
Draining OSDs attached to a failing SSD
In order to drain the osds attached to a failing SSD, run the following command:
$> cd /root/ceph-scripts/tools/ceph-disk-replacement
$> ./ssd-drain-osd.sh --dev /dev/<ssd>
ceph osd out osd.<osd0>;
ceph osd primary-affinity osd.<osd0> 0;
ceph osd out osd.<osd1>;
ceph osd primary-affinity osd.<osd1> 0;
...
ceph osd out osd.<osdN>;
ceph osd primary-affinity osd.<osdN> 0;
If the output is similar to the one above, it is safe to re-run the commands adding | sh
to actually put out of the cluster all the osds attached to the ssd.
Prepare for replacement
Once the draining has been started, the osds need to be zapped before the ssd can be removed and physically replaced:
$> ./ssd-prepare-for-replacement.sh --dev /dev/<dev> -f
systemctl stop ceph-osd@<osd0>
umount /var/lib/ceph/osd/ceph-<osd0>
ceph-volume lvm zap --destroy --osd-id <osd0>
systemctl stop ceph-osd@<osd1>
umount /var/lib/ceph/osd/ceph-<osd1>
ceph-volume lvm zap --destroy --osd-id <osd1>
...
systemctl stop ceph-osd@<osdN>
umount /var/lib/ceph/osd/ceph-<osdN>
ceph-volume lvm zap --destroy --osd-id <osdN>
Recreate the OSD
TBC
MDS Slow Ops
Check for long ongoing operations on the MDS reporting Slow Ops:
The mon shows SLOW_OPS warning:
ceph health details
cat /var/log/ceph/ceph.log | grep SLOW
cluster [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)
The affected MDS shows slow request in the logs:
cat /var/log/ceph/ceph-mds.cephcpu21-0c370531cf.log | grep -i SLOW
2022-10-22T09:09:21.473+0200 7fe1b8054700 0 log_channel(cluster) log [WRN] : 30 slow requests, 1 included below; oldest blocked for > 2356.704295 secs
2022-10-22T09:09:21.473+0200 7fe1b8054700 0 log_channel(cluster) log [WRN] : slow request 1924.631928 seconds old, received at 2022-10-22T08:37:16.841403+0200: client_request(client.366059605:743931 getattr AsXsFs #0x10251604c38 2022-10-22T08:37:16.841568+0200 caller_uid=1001710000, caller_gid=0{1001710000,}) currently dispatched
Dump the ongoing ops and check there are some with very long (minutes, hours) age:
ceph daemon mds.`hostname -s` ops | grep age | less
Identify the client with such long ops (age should be >900):
ceph daemon mds.`hostname -s` ops | egrep 'client|age' | less
"description": "client_request(client.364075205:4876 getattr pAsLsXsFs #0x1023f14e5d8 2022-10-16T03:46:40.673900+0200 RETRY=184 caller_uid=0, caller_gid=0{})",
"age": 0.87975248399999995,
"reqid": "client.364075205:4876",
"op_type": "client_request",
"client_info": {
"client": "client.364075205",
Get info on the client:
ceph daemon mds.`hostname -s` client ls id=<THE_ID>
- IP address
- Hostname
- Ceph client version
- Kernel version (in case of a kernel mount)
- Mount point (on the client side)
- Root (aka, the CephFS volume the client mounts)
Evict the client:
ceph tell mds.* client ls id=<THE_ID>
ceph tell mds.* client evict id=<THE_ID>
Large omap objects
On S3 clusters, it may happen to see a HEALTH_WARN message reporting 1 large omap objects
.
This is very likely due to bucket index(es) being over full. Example:
"user_id": "warp-tests",
"buckets": [
{
"bucket": "warp-tests",
"tenant": "",
"num_objects": 9993106,
"num_shards": 11,
"objects_per_shard": 908464,
"fill_status": "OVER"
}
]
Proceed as follows:
- Check bucket index(es) being over full is the actual problem:
radosgw-admin bucket limit check
- If it it not possible to reshard the bucket tune
osd_deep_scrub_large_omap_object_key_threshold
properly
Default is 200000; Gabe runs with 500000. Read at 42on.comceph config set osd osd_deep_scrub_large_omap_object_key_threshold 300000
- If it is possible to reshard the bucket, manually reshard any bucket showing
fill_status
WARN
orOVER
:- keep the number of objects per shard around 50k
- pick a prime number of shards
- consider if the bucket will be ever growing or owners delete objects. If ever-growing, you may reshard to a high number of shards to avoid (or postpone) resharding in the future.
radosgw-admin bucket reshard --bucket=warp-tests --num-shards=211
- Check in
ceph.log
which is the PG complining about the large omap objects and start a deep scrub on it (else the HEALTH_WARN won't go away)# zcat /var/log/ceph/ceph.log-20221204.gz | grep -i large 2022-12-03T06:48:37.975544+0100 osd.179 (osd.179) 996 : cluster [WRN] Large omap object found. Object: 9:22f5fbf8:::.dir.a1035ed2-37be-4e7d-892d-46728bc3d046.285532.1.1:head PG: 9.1fdfaf44 (9.344) Key count: 204639 Size (bytes): 60621488 2022-12-03T06:48:39.270652+0100 mon.cephdata22-12f31fcca0 (mon.0) 292373 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS) # ceph pg deep-scrub 9.344 instructing pg 9.344 on osd.179 to deep-scrub
Ceph Clusters
Production Clusters
Cluster | Lead | Use-case | Mon host (where?) | Release | Version | OS | Racks | IP Services | Power | SSB Upgrades? |
---|---|---|---|---|---|---|---|---|---|---|
barn | Enrico | Cinder: cp1, cpio1 | cephbarn (hw) | pacific | 16.2.9-1 | RHEL8 | BA09 | S513-A-IP250 | UPS-4/-C | Yes |
beesly | Enrico | Glance Cinder: 1st AZ | cephmon (hw) | pacific | 16.2.9-1 | RHEL8 | CD27-CD30 BA10-BA12 | S513-C-IP152 S513-A-IP38 S513-A-IP63 | UPS-3/-4 UPS-4/-C | Yes |
cta | Roberto | CTA prod | cephcta (hw) | octopus | 15.2.15-0 | RHEL8 | SI36-SI41 | - | No, Julien Leduc | |
dwight | Dan | Testing + Manila: CephFS Testing | cephmond (vm,abc) | pacific | 16.2.9-2 | Alma8 | CE01-CE03 | S513-C-IP501 | Yes + Manila MM | |
doyle | CephFS for DFS Projects | cephdoyls (hw) | quincy | 17.2.6-4 | RHEL9 | CP18, CP19-21, CP22 | S513-C-IP200 | UPS-1 | Yes + Sebast/Giuseppe | |
flax(*) | Abhi | Manila: Meyrin CephFS | cephflax (vm,abc) | pacific | 16.2.9-1 | RHEL8 | BA10,SQ05 CQ18-CQ21 SJ04-SJ07 | S513-A-IP558,S513-V-IP562 S513-C-IP164 S513-V-IP553 | UPS-4/-C,UPS-1 UPS-1 UPS-3 | Yes |
gabe | Enrico | S3 | cephgabe (hw) | pacific | 16.2.13-5 | RHEL8 | SE04-SE07 SJ04-SJ07 | S513-V-IP808 S513-V-IP553 | UPS-1 UPS-3 | Yes |
jim | Enrico | HPC BE (CephFS) | cephjim (vm,abc) | octopus | 15.2.15-2 | RHEL8 | SW11-SW15 SX11-SX15 | S513-V-IP194 S513-V-IP193 | UPS-3 UPS-3 | No, Nils Hoimyr |
kelly | Roberto | Cinder: hyperc + CTA preprod | cephkelly (hyperc) | pacific | 16.2.13-5 | RHEL8 | CQ12-CQ22 | S513-C-IP164 | UPS-1 | Yes + Julien Leduc |
kapoor | Enrico | Cinder: cpio2, cpio3 | cephkapoor (hyperc) | quincy | 17.2.6-4 | RHEL8 | BE10 BE11 BE13 | S513-A-IP22 | UPS-4/-C | Yes |
levinson | Abhi | Manila: Meyrin CephFS SSD A | cephlevinson (hw) | pacific | 16.2.9-1 | RHEL8 | BA03 BA04 BA05 BA07 | S513-A-IP120 S513-A-IP119 S513-A-IP121 S513-A-IP122 | UPS-4/-C | Yes |
meredith | Enrico | Cinder: io2, io3 | cephmeredith (hw) | pacific | 16.2.9-1 | RHEL8 | CK01-23 | S513-C-IP562 | UPS-2 | Yes |
nethub | Enrico | S3 FR + Cinder FR | cephnethub (hw) | pacific | 16.2.13-5 | RHEL8 | HA06-HA09 HB01-HB06 | S773-C-SI180 S773-C-IP200 | EOD104,ESK404 EOD105 (CEPH-1519) | Yes |
pam | Abhi | Manila: Meyrin CephFS B | cephpam (hw) | pacific | 16.2.9-1 | Alma8 | CP16-19 | S513-C-IP200 | UPS-1 | Yes |
poc | Enrico | PCC Proof of Concept (CEPH-1382) | cephpoc (hyperc) | pacific | 16.2.9-2 | RHEL8 | SU06 | S513-V-SI263 | No | |
ryan | Enrico | Cinder: 3rd AZ | cephryan (hw) | pacific | 16.2.9-1 | RHEL8 | CE01-CE03 | S513-C-IP501 | UPS-2 | Yes |
stanley | Zachary | S3 multi-site, Meyrin | cephstanmey (hw) | pacific | 17.2.5 | RHEL8 | CP16-24 | S513-C-IP200 | UPS-1 | No |
stanley | Zachary | S3 multi-site, Nethub | cephstannet (hw) | pacific | 17.2.5 | Alma8 | HB01-HB06 | S773-C-IP200 | EOD105/0E | No |
toby | Enrico | Stretch cluster | cephtoby (hw) | pacific | 16.2.9-1 | RHEL8 | CP16-19 SJ04-07 | S513-C-IP200 S513-V-IP553 | UPS-1 UPS-3 | No |
vance | Enrico | Manila: HPC Theory-QCD | cephvance (hw) | pacific | 16.2.9-1 | Alma8 | CP16-CP17, CP19, CP21, CP23-CP24 | S513-C-IP200 | UPS-1 | No, Nils Hoimyr |
wallace | Enrico | krbd: Oracle DB restore tests | cephwallace (hw) | pacific | 16.2.9-2 | RHEL8 | CP18, CP20, CP22 | S513-C-IP200 | UPS-1 | No, Sebastien Masson |
vault | Enrico | Cinder: 2nd AZ | cephvault (hw) | pacific | 16.2.9-1 | RHEL8 | SE04-SE07 | S513-V-IP808 | UPS-1 | Yes |
Flax locations details:
- MONs: 3x OpenStack VMs, one in each availability zone
- MDSes (CPU servers): 50% in barn, 50% in vault
cephcpu21-0c370531cf
, SQ05, S513-V-IP562, UPS 1 (EOD1*43)cephcpu21-2456968853
, SQ05, S513-V-IP562, UPS 1 (EOD1*43)cephcpu21-46bb400fc8
, BA10, S513-A-IP558cephcpu21-4a93514bf3
, BA10, S513-A-IP558cephcpu21b-417b05bfee
, BA10, S513-A-IP558cephcpu21b-4ad1d0ae5f
, SQ05, S513-V-IP562, UPS 1 (EOD1*43)cephcpu21b-a703fac16c
, SQ05, S513-V-IP562, UPS 1 (EOD1*43)cephcpu21b-aecbee75a5
, BA10, S513-A-IP558
- Metadata pool: Main room, UPS-1 EOD1*43
- Data pool: Vault, UPS-3 EOD3*43
Each production cluster has a designated cluster lead, who is the primary contact and responsible for that cluster.
The user-visible "services" provided by the clusters are documented in our Service Availability probe: https://gitlab.cern.ch/ai/it-puppet-hostgroup-ceph/-/blob/qa/code/files/sls/ceph-availability-producer.py#L19
The QoS provided by each user-visible cluster is described in OpenStack docs. Cinder volumes available on multiple AZs are of standard and io1 types.
s3.cern.ch RGWs
Hostname | Customer | IPv4 | IPv6 | IPsvc VM | IPsvc Real | Runs on | OpenStack AZ | Room | Rack | Power |
---|---|---|---|---|---|---|---|---|---|---|
cephgabe-rgwxl-325de0fb1d | cvmfs | 137.138.152.241 | 2001:1458:d00:13::1e5 | S513-C-VM33 | 0513-C-IP33 | P06636663U66968 | cern-geneva-a | main | CH14 | UPS-3 |
cephgabe-rgwxl-86d4c90cc6 | cvmfs | 137.138.33.24 | 2001:1458:d00:18::390 | S513-V-VM936 | 0513-V-IP35 | P06636688Q51842 | cern-geneva-b | vault | SQ27 | UPS-4 |
cephgabe-rgwxl-8930fc00f8 | cvmfs | 137.138.151.203 | 2001:1458:d00:12::3e0 | S513-C-VM32 | 0513-C-IP32 | P06636663N63480 | cern-geneva-c | main | CH11 | UPS-3 |
cephgabe-rgwxl-8ee4a698b7 | cvmfs | 137.138.44.245 | 2001:1458:d00:1a::24b | S513-C-VM933 | 0513-C-IP33 | P06636663J50924 | cern-geneva-a | main | CH16 | UPS-3 |
cephgabe-rgwxl-3e0d67a086 | default | 188.184.73.131 | 2001:1458:d00:4e::100:4ae | S513-A-VM805 | 0513-A-IP561 | I82006520073152 | cern-geneva-c | barn | BC11 | UPS-4/-C |
cephgabe-rgwxl-652059ccf1 | default | 188.185.87.72 | 2001:1458:d00:3f::100:2bd | S513-A-VM559 | 0513-A-IP559 | I82006525008611 | cern-geneva-a | barn | BC06 | UPS-4/-C |
cephgabe-rgwxl-8e7682cb81 | default | 137.138.158.145 | 2001:1458:d00:14::341 | S513-V-VM35 | 0513-V-IP35 | P06636688R71189 | cern-geneva-b | vault | SQ28 | UPS-4 |
cephgabe-rgwxl-91b6e0d6dd | default | 137.138.77.21 | 2001:1458:d00:1c::405 | S513-C-VM931 | 0513-C-IP33 | P06636663M67468 | cern-geneva-a | main | CH13 | UPS-3 |
cephgabe-rgwxl-895920ea1a | gitlab | 137.138.158.221 | 2001:1458:d00:14::299 | S513-V-VM35 | 0513-V-IP35 | P06636688H41037 | cern-geneva-b | vault | SQ29 | UPS-4 |
cephgabe-rgwxl-9e3981c77a | gitlab | 137.138.154.49 | 2001:1458:d00:13::3a | S513-C-VM33 | 0513-C-IP33 | P06636663J50924 | cern-geneva-a | main | CH16 | UPS-3 |
cephgabe-rgwxl-dbb0bcc513 | gitlab | 188.184.102.175 | 2001:1458:d00:3b::100:2a9 | S513-C-VM852 | 0513-C-IP852 | I78724428177369 | cern-geneva-c | main | EK03 | UPS-2 |
cephgabe-rgwxl-26774321ac | jec-data | 188.185.10.120 | 2001:1458:d00:63::100:39a | S513-V-VM902 | 0513-V-IP402 | I88681450454656 | cern-geneva-a | vault | SP23 | UPS-4 |
cephgabe-rgwxl-a273d35b9d | jec-data | 188.185.19.171 | 2001:1458:d00:65::100:32a | S513-V-VM406 | S513-V-IP406 | I88681458914473 | cern-geneva-b | vault | SP27 | UPS-4 |
cephgabe-rgwxl-d91c221898 | jec-data | 137.138.155.51 | 2001:1458:d00:13::14d | S513-C-VM33 | 0513-C-IP33 | P06636663Y16806 | cern-geneva-a | main | CH15 | UPS-3 |
cephgabe-rgwxl-75569ebe5c | prometheus | 137.138.149.253 | 2001:1458:d00:12::52f | S513-C-VM32 | 0513-C-IP32 | P06636663G98563 | cern-geneva-c | main | CH04 | UPS-3 |
cephgabe-rgwxl-7658b46c78 | prometheus | 188.185.9.237 | 2001:1458:d00:63::100:424 | S513-V-VM902 | 0513-V-IP402 | I88681457779137 | cern-geneva-a | vault | SP24 | UPS-4 |
cephgabe-rgwxl-05386c6cdb | vistar | 188.185.86.117 | 2001:1458:d00:3f::100:2d9 | S513-A-VM559 | 0513-A-IP559 | I82006526449210 | cern-geneva-a | barn | BC05 | UPS-4/-C |
cephgabe-rgwxl-13f36a01c2 | vistar | 137.138.33.10 | 2001:1458:d00:18::1ee | S513-V-VM936 | 0513-V-IP35 | P06636688C41209 | cern-geneva-b | vault | SQ29 | UPS-4 |
cephgabe-rgwxl-6da6da7653 | vistar | 188.184.74.136 | 2001:1458:d00:4e::100:5d | S513-A-VM805 | 0513-A-IP561 | I82006527765435 | cern-geneva-c | barn | BC13 | UPS-4/-C |
Reviewing a Cluster Status
- Check Grafana dashboards for unusual activity, patterns, memory usage:
- https://filer-carbon.cern.ch/grafana/d/000000001/ceph-dashboard
- https://filer-carbon.cern.ch/grafana/d/000000108/ceph-osd-mempools
- https://filer-carbon.cern.ch/grafana/d/uHevna1Mk/ceph-hosts
- For RGWs: https://filer-carbon.cern.ch/grafana/d/iyLKxjoGk/s3-rgw-perf-dumps
- For CephFS: * https://filer-carbon.cern.ch/grafana/d/000000111/cephfs-detail
- etc...
- Login to cluster mon and check various things:
ceph osd pool ls detail
- are the pool flags correct? e.g.nodelete,nopgchange,nosizechange
ceph df
- assess amount of free space for capacity planningceph osd crush rule ls
,ceph osd crush rule dump
- are the crush rules as expected?ceph balancer status
- as expected?ceph osd df tree
- are the PGs per OSD balanced and a reasonable number, e.g. < 100.ceph osd tree out
,ceph osd tree down
- are there any OSDs that are not being replaced properly?ceph config dump
- is the configuration as expected?ceph telemetry status
- check from config if it on, enable it
Clusters' priority
In case of a major incident (e.g., power cuts), revive clusters in the following order:
- Beesly (RBD1, main, UPS-3/4), Flax (CephFS, everywhere), Gabe (S3, vault, UPS-1/3)
- Vault (RBD2, vault, UPS-1), Levinson (CephFS SSD, vault, UPS-1), Meredith (RBD SSD, main, UPS-2)
- Ryan (RBD3, main, UPS-2), CTA (ObjectStore, vault, UPS-1)
- Jim, Dwight, Kelly, Pam (currently unused)
- Barn, Kopano -- should not go down, as they are in critical power
- NetHub -- 2nd network hub, Prevessin, diesel-backed (9/10 racks)
Hardware Specs
Test clusters
Cluster | Use-case | Mon alias | Release | Version | Notes |
---|---|---|---|---|---|
cslab | Test cluster for Network Lab (RQF2068297,CEPH-1348) | cephcslab | pacific | 16.2.9-1 | Binds to IPv6 only; 3 hosts Alma8 + 3 RHEL8 |
miniflax | Mini cluster mimicking Flax | None (ceph/miniflax/mon) | pacific | 16.2.9-2 | |
minigabe | Mini cluster mimicking Gabe (zone groups) | cephminigabe | pacific | 16.2.9-2 | RGW on minigabe-831ffcf9f9; Beast on 8080; RGW DNS: cephminigabe |
octopus | Testing | cephoctopus-1 | pacific | 16.2.9-1 | |
next | RC and Cloud next region testing | cephnext01 | quincy | 17.2.6-4 |
Preparing a new delivery
Flavor per rack
We now want to have flavors per rack for our Ceph clusters, please reminds people from Ironic/CF to do that when a new delivery is installed!
Setting root device hints
We set root device hints on every new delivery so that we can be certain that Ironic installs the OS on the right drive (and if the corresponding drive fails the installation also fails).
There are multiple ways to set root device hints (see the OpenStack documentation). For our recent deliveries setting the model is typically sufficient to have only one possible drive for the root device.
To get the model of the drive you have to boot a node and get it from
/sys/class
, for instance: cat /sys/class/block/nvme0n1/device/model
(you
may also ask to get access to Ironic inspection data if it gets more
complicated than that).
Then you can set the model on every nodes of the delivery.
For instance, for delivery dl8642293
you would do:
export OS_PROJECT_NAME="IT Ceph Ironic"
openstack baremetal node list -f value | \
grep dl8642293 | awk '{print $1}' | \
xargs -L1 openstack baremetal node set --property root_device='{"model": "SAMSUNG MZ1LB960HAJQ-00007"}'
If it looks correct, pipe the output to shell to actually set the root device hints.
Check the root device hints were correctly set with:
export OS_PROJECT_NAME="IT Ceph Ironic"
openstack baremetal node list -f value | \
grep dl8642293 | awk '{print $1}' | \
xargs -L1 openstack baremetal node show -f json | jq .properties.root_device
Ceph Monitoring
About Ceph Monitoring
The monitoring system in Ceph is based on Grafana, using Prometheus as datasource and the native ceph prometheus plugin as
metric exporter. Prometheus node_exporter
is used for node metrics (cpu, memory, etc).
For long-term metric storage, Thanos is used to store metrics in S3 (Meyrin)
Access the monitoring system
-
All Ceph monitoring dashboards are available in monit-grafana (Prometheus) and filer-carbon (Graphite - Legacy)
-
The prometheus server is configured in the host
cephprom.cern.ch
, hostgroupceph/prometheus
-
Configuration files (Puppet):
it-puppet-hostgroup-ceph/code/manifests/prometheus.pp
it-puppet-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml
it-puppet-hostgroup-ceph/data/hostgroup/ceph.yaml
- Alertmanager templates:
it-puppet-hostgroup-ceph/code/files/prometheus/am-templates/ceph.tmpl
- Alert definition:
it-puppet-hostgroup-ceph/code/files/generated_rules/
-
Thanos infrastructure is under
ceph/thanos
hostgroup, configured via the corresponding hiera files.
A analog qa
infrastructure is also available, which all components replicated (cephprom-qa, thanos-store-qa, etc). This qa
infra is configured overriding the puppet environment:
it-puppet-hostogroup-ceph/data/hostgroup/ceph/environments/qa.yaml
Add/remove a cluster to/from the monitoring system
- Enable the prometheus mgr module in the cluster:
ceph mgr module enable prometheus
NOTE: Make sure that the port 9283
is accepting connections.
Instances that include the hg_ceph::classes::mgr
class will be automatically discovered through puppetdb and scraped by prometheus.
- To ensure that we don't lose metrics during mgr failovers, all the cluster mgr's will be scraped. As a side benefit, we can monitor the online status of the mgr's.
- Run or wait for a puppet run on
cephprom.cern.ch
.
Add/remove a node for node metrics (cpu, memory, etc)
Instances that include the prometheus::node_exporter
class (anything under ceph
top hostgroup) will be automatically discovered through puppetdb and scraped by prometheus.
Add/remove an alert rule to/from the monitoring system
Alerts are defined in yaml
files managed by puppet in:
it-puppet-hostgroup-ceph/files/prometheus/generated_rules
They are organised in services, so add the alert in the appropiate file (e.g: ceph alerts in alerts_ceph.yaml
). The file rules.yaml
is used to add recorded rules
There are 3 notification channels currently: e-mail, SNOW ticket and Mattermost message.
Before creating the alert, make sure you test your query in advance, for example using the Explore panel on Grafana. Once the query is working, proceed with the alert definition.
A prometheus alert could look like this:
rules:
- alert: "CephOSDReadErrors"
annotations:
description: "An OSD has encountered read errors, but the OSD has recovered by retrying the reads. This may indicate an issue with hardware or the kernel."
documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-spurious-read-errors"
summary: "Device read errors detected on cluster {{ $labels.cluster }}"
expr: "ceph_health_detail{name=\"BLUESTORE_SPURIOUS_READ_ERRORS\"} == 1"
for: "30s"
labels:
severity: "warning"
type: "ceph_default"
alert
: Mandatory. Name of the alert, which will be part of the subject of the email, head of the ticket and title of the mattermost notification. Try to follow the same pattern as the ones already createdCephDAEMONAlert
. Daemon in uppercase and rest in camel case.expr
: Mandatory. PromQL query that defines the alert. The alert will trigger if the query returns one of more matches. It's a good exercise to usepromdash
for tuning the query to ensure that it is well formed.for
: Mandatory.The alert will be triggered if stays activefor
more than the specified time (e.g30s
,1m
,1h
).annotations:summary
: Mandatory. Express the actual alert in a a concise way.annotations:description
: Optional. Allow to specify more detailed information about the alert when the summary is not enough.annotation:documentation
: Optional. Allows to specify the url of the documentation/procedure to follow to handle the alert.labels:severity
: Mandatory. Defines the notification channel to use, based on the following:warning
/critical
: Sends an e-mail to ceph-alerts.ticket
: Sends an e-mail AND creates an SNOW ticket.mattermost
: Sends an e-email AND sends a Mattermost message to the ceph-bot channel.
labels:type
: Optional. Allows to distinguish from alerts created upstreamceph_default
from created by usceph_cern
. It has no actual implication on the alert functionality.labels:xxxxx
: Optional. You can add custom labels that could be used on the template.
NOTES
- In order for the templating to work as expected, make sure that labels
cluster
orjob_name
are part of the resulting query. In case the query does not preserve labels (likecount
) you can specify manually the label and value in thelabels
section in the alert definition.- All annotations, if defined, will appear in the body of the ticket, e-mail or mattermost message generated by the alert.
- Alerts are evaluated against the local prometheus server which contains metrics for the last 7 days. Take that into account while defining alerts that evaluates longer periods (like
linear_predict
). In such cases, you can create the alert in Grafana using the Thanos-LTMS metric datasource (more on that later this doc) .- In
grafana
orpromdash
you can access the alerts querying the metric calledALERTS
- For more information about how to define an alert, refer to the Prometheus Documentation
Create / Link procedure/documentation to Prometheus Alert.
Prometheus alerts are pre-configured to show the procedure needed for handling the alert via the annotation procedure_url
. This is an optional argument that could be configured per alert rule.
Step 1: Create the procedure in case does not exist yet.
Update the file rota.md
on this repository and add the new procedure. Use this file for convenience, but you can create a new file if needed.
Step 2: Edit the alert rule and link to the procedure.
Edit the alert following instructions above, and add the link to the procedure under the annotations
section, under the key documentation
, for example:
- alert: "CephMdsTooManyStrays"
annotations:
documentation: "http://s3-website.cern.ch/cephdocs/ops/rota.html#cephmdstoomanystrays"
summary: "The number of strays is above 500K"
expr: "ceph_mds_cache_num_strays > 500000"
for: "5m"
labels:
severity: "ticket"
Push the changes and prometheus server will reload automatically picking the new changes. Next time the alert is triggered, a link to the procedure will be shown in the alert body.
Silence Alarms
You can use the alertmanager Web Interface to silence alarms during scheduled interventions. Please always specify a reason for silencing the alarms (a JIRA link or ticket would be a plus). Additionally, for the alerts that generate an e-mail, you will find a link to silence it in the email body.
Alert Grouping
Alert grouping is enabled by default, so if the same alert is triggered in different nodes, we only receive one ticket with all involved nodes.
Modifying AlertManager Templates
Both email and Snow Ticket templates are customizable. For doing that, you need to edit the following puppet file:
it-puppet-hostgroup-ceph/code/files/prometheus/am-templates/ceph.tmpl
You have to use Golang's Template syntax. The structure of the file is as follows:
{{ define "ceph.email.subject" }}
....
{{ end }}
{{ define "ceph.email.body" }}
....
{{ end }}
For reference check the default AlertManager Templates
In case you add templates make sure that you adapt the AlertManager
configuration accordingly:
- name: email
email_configs:
- to: ceph-admins@cern.ch
from: altertmanager@locahost
smarthost: cernmx.cern.ch:25
headers:
Subject: '{{ template "ceph.email.subject" . }}'
html: '{{ template "ceph.email.body" . }}'
Note A restart of AlertManager is needed for the changes to be applied.
Accessing the prometheus dashboard (promdash)
The prometheus dashboard or Dashprom
is a powerful interface that allows to quickly asses the prometheus server status and also provide a quick way of querying metrics. The prometheus dashboard is accesible from this link: Promdash.
- The prometheus dashboard is useful for:
- Checking the status of all targets: Target status
- Check the status of the alerts Alert Status
- For debug purposes, you can execute PromQL queries directly on the dashboard and change the intervals quickly.
- In grafana there is an icon just near the metric definition to view the current query in promdash.
- You can also use the Grafana Explorer.
Note: This will only give you access to the metrics of the last 7 days, refer to the next chapter for accessing older metrics.
Long Term Metric Storage - LTMS
The long term storage metrics are kept in S3 CERN Service using Thanos. The
bucket is called prometheus-storage
and is accessed using the EC2 credentials of Ceph's Openstack Project. Accesing
to this metrics is transparent from Grafana:
- Metrics of the last 7 days are served directly from prometheus local storage
- Older metrics are pulled from S3.
- As metrics in S3 contains downsampled versions (5m, 1h) is usually much faster that getting metrics from the local prometheus.
- RAW metrics are also kept, so it is possible to zoom-in to the 15 second-resolution
Accessing the thanos dashboard
There is a thanos promdash
version here, from where you can access all historical
metrics. This dashboard has some specific thanos features like deduplication
(for use cases with more than one
prometheus servers scrapping the same data) and the possibility of showing downsampled
data (thanos stores two
downsampled versions of the metrics, with 1h and 5m resolution). This downsampled data is also stored in S3.
Thanos Architecture
You can find more detailed information in Thanos official webpage, but these are the list of active components in our current setup and the high level description of what they do:
Sidecar
- Every time Prometheus dumps the data to disk (by default, each 2 hours), the
thanos-sidecar
uploads the metrics to the S3 bucket. It also acts as a proxy that serves Prometheus’s local data.
Store
- This is the storage proxy which serves the metrics stored in S3
Querier
- This component reads the data from
store(s)
andsidecar(s)
and answers PromSQL using the standard Prometheus HTTP API. This is the component you have to point from monitoring dashboards.
Compactor
- This is a detached component which compacts the data in S3 and also creates the downsampled versions.
Operating the Ceph Monitors (ceph-mon)
Adding ceph-mon daemons (VM, jewel/luminous)
Upstream documentation at http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/
Create the machine for the mon
Normally we create ceph-mon's as VMs in the ceph/{hg_name}/mon hostgroup.
Example: Adding a monitor to the ceph/test cluster:
- First, source the IT Ceph Storage Service environment on aiadm: link
- Then create a virtual machine with the following parameters:
- main-user/responsible: ceph-admins (the user of the VM)
- VM Flavor: m2.2xlarge (monitors must withstand heavy loads)
- OS: Centos7 (the preferred OS used in CERN applications)
- Hostgroup: ceph/test/mon (Naming convention for puppet configuration)
- VM name: cephtest-mon- (We use prefix to generate an id)
- Availability zone: usually cern-geneva-[a/b/c]
Example command: (It will create a VM with the above parameters)
$ ai-bs --landb-mainuser ceph-admins --landb-responsible ceph-admins --nova-flavor m2.2xlarge
--cc7 -g ceph/test/mon --prefix cephtest-mon- --nova-availabilityzone cern-geneva-a
--nova-sshkey {your_openstack_key}
This command will create a VM named cephtest-mon-XXXXXXXXXX
in the ceph/test/mon
hostgroup. Puppet will take care of the initialization of the machine
When you deploy a monitor server, you have to choose an availability zone. We tend to use different availability zones to avoid a single point of failure.
Set roger state and enable alarming
Set the appstate
and app_alarmed
parameters if necessary
Example: Get the roger data for the VM cephtest-mon-d8788e3256
$ roger show cephtest-mon-d8788e3256
The output should be something similar to this:
[
{
"app_alarmed": false,
"appstate": "build",
"expires": "",
"hostname": "cephtest-mon-d8788e3256.cern.ch",
"hw_alarmed": true,
"message": "",
"nc_alarmed": true,
"os_alarmed": true,
"update_time": "1506418702",
"update_time_str": "Tue Sep 26 11:38:22 2017",
"updated_by": "tmourati",
"updated_by_puppet": false
}
]
You need to set the machine's state to "production", so it can be used in production.
The following command will set the target VM to production state:
$ roger update --appstate production --all_alarms=true cephtest-mon-XXXXXXXXXX
Now the roger show {host}
should show something like this:
[
{
"app_alarmed": true,
"appstate": "production",
"..."
}
]
We now let puppet configure the machine. This task will take an adequate amount of time, as it needs about two configuration cycles to apply the desired changes. After the second cycle you can SSH (as root) to the machine to check if everything is ok.
For example you can check the cluster's status with $ ceph -s
You should see the current host in the monitor quorum.
Details on lbalias for mons
We prefer not to use load-balancing service and lbclient
here (https://configdocs.web.cern.ch/dnslb/).
There is no scenario in ceph where we want a mon to disappear from the alias.
For a bare metal node
We rather use the --load-N
appoarch to create the alias with all the mons:
- Go to
network.cern.ch
- Click on
Update information
and use the FQDN of the mon machine- If prompted, make sure you host interface and not the IPMI one
- Add "ceph{hg_name}--LOAD-N-" to the list IP Aliases under TCP/IP Interface Information
- Multiple aliases are supported. Use a comma-separated list
- Check the changes are correct and submit the request
For a openstack VM
In the case of a VM, we can't directly set an alias, but can set a property in OS to the same effectL
- Log onto aiadm or lxplus
- Set your environmental variables to the correct tenant e.g. `eval $(ai-rc 'Ceph Development')
- Check the vars are what you expect with
env | grep OS
paying attention toOS_region
- Check the vars are what you expect with
- set the alias using openstack with
openstack server set --property landb-alias=CEPH{hg_name}--LOAD-N- {hostname}
Removing a ceph-mon daemon (jewel)
Upstream documentation at http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/
Prerequisites
- The cluster must be in
HEALTH_OK
state, i.e. the monitor must be in a a healthy quorum. - You should have a replacement for the current monitor already in the quorum. And there should be enough monitors so that the cluster can be healthy after one monitor is removed. Normally this means that we should have about 4 monitors in the quorum before starting.
Procedure
- Disable puppet:
$ puppet agent --disable 'decommissioning mon'
- (If needed) remove the DNS alias from this machine and wait until it is so:
- For physical machines, visit http://network.cern.ch → "Update Information".
- For a VM monitor, you can remove the alias from the `landb-alias` property. See [Cloud Docs](https://clouddocs.web.cern.ch/clouddocs/using_openstack/properties.html)
- Check if monitor is ok-to-stop:
$ ceph mon ok-to-stop <hostname>
- Stop the monitor:
$ systemctl stop ceph-mon.target
. You should now get aHEALTH_WARN
status by running$ ceph -s
, for example1 mons down, quorum 1,2,3,4,5
. - Remove the monitor's configuration, data and secrets with:
```sh
$ rm /var/lib/ceph/tmp/keyring.mon.*
$ rm -rf /var/lib/ceph/mon/<hostname>
```
- Remove the monitor from the ceph cluster:
```sh
$ ceph mon rm <hostname>
removing mon.<hostname> at <IP>:<port>, there will be 5 monitors
```
- You should now have a
HEALTH_OK
status after the monitor removal. - (If monitored by prometheus) remove the hostname from the list of endpoints to monitor. See it-puppet-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml
For machines hosting uniquely the ceph mon
-
Move this machine to a spare hostgroup:
$ ai-foreman updatehost -c ceph/spare {hostname}
-
Run puppet once:
$ puppet agent -t
-
(If physical) Reinstall the server in the
ceph/spare
hostgroup:
```sh
aiadm> ai-installhost p01001532077488
...
1/1 machine(s) ready to be installed
Please reboot the host(s) to start the installation:
ai-remote-power-control cycle p01001532077488.cern.ch
aiadm> ai-remote-power-control cycle p01001532077488.cern.ch
```
Now the physical machine is installed in the ceph/spare
hostgroup.
- (If virtual) Kill the vm with:
$ ai-kill-vm {hostname}
For machines hosting other ceph-daemons
- Move this machine to another hostgroup (e.g.,
/osd
) of the same cluster:$ ai-foreman updatehost -c ceph/<cluster_name>/osd {hostname}
- Run puppet to apply the changes:
$ puppet agent -t
Operating the Ceph Metadata Servers (ceph-mds)
Adding a ceph-mds daemon (VM, luminous)
Upsream documentation here: http://docs.ceph.com/docs/master/rados/deployment/ceph-deploy-mds/
The procedure follows the same pattern as adding a monitor node (create_a_mon) to the cluster.
Make sure you add your mds to the corresponding hostgroup ceph/<cluster>/mds
and prepare
the Puppet code (check other ceph clusters with cephfs as a reference)
Example for the ceph/mycluster
hostgroup:
$ ai-bs --landb-mainuser ceph-admins --landb-responsible ceph-admins \
--nova-flavor m2.2xlarge --cc7 -g ceph/<mycluster>/mds --prefix ceph<mycluster>-mds- \
--nova-availabilityzone cern-geneva-a
Note: When deploying more than one mds, make sure that they are spreaded into different availability zones.
As written in the upstream documentation, a ceph filesystem needs at least two metadata servers. The first will be the main server that will handle the clients' requests and the second one is the backup. Don't forget also to put the metadata servers into different availability zones, in case some problem occurs to a site.
Because of resource limitations, the flavor of the machines could be m2.xlarge
instead of m2.2xlarge
. In the ceph/<mycluster>
cluster we use 2 m2.2xlarge
main
servers and one m2.xlarge
backup server.
When the machine is available (reachable by the dns service), you can alter its
state into production with roger
.
$ roger update --appstate production --all_alarms=true ceph<mycluster>-mds-XXXXXXXXXX
After 2-3 runs of puppet
Using additional metadata servers (luminous)
Upstream documentation here: http://docs.ceph.com/docs/master/cephfs/multimds/
When your cephfs system can't handle the amount of client requests, you notice warnings
about the mds or the requests on ceph status
, you may need to use multiple active metadata
servers.
After adding an mds to the cluster, you will notice on ceph status
on the mds line
something like the following line.
mds: cephfs-1/1/1 up {0=cephironic-mds-716dc88600=up:active}, 1 up:standby-replay, 1 up:standby
The 1 up:standby-replay
is the backup server and the 1 up:standby
that is shown
recently is the mds we just added. To make the standby server active, we need to
execute the following line:
WARNING: Your cluster may have multiple filesystems, use the right one!
ceph fs set <fs_name> max_mds 2
The name of the ceph filesystem can be retrieved by using $ ceph fs ls
and looking
for the name: <fs_name>
key-value pair.
Now your ceph status message should look like this:
...
mds: cephfs-2/2/2 up {0=cephironic-mds-716dc88600=up:active,1=cephironic-mds-c4fbd7ee74=up:active}, 1 up:standby-replay
...
OSD Replacement Procedures
Check which disks needs to be put back in procedures.
- To see which osds are down, check with
ceph osd tree down out
.
[09:28][root@p06253939p44623 (production:ceph/beesly/osd*24) ~]# ceph osd tree down out
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 5589.18994 root default
-2 4428.02979 room 0513-R-0050
-6 917.25500 rack RA09
-7 131.03999 host p06253939j03957
430 5.45999 osd.430 down 0 1.00000
-19 131.03999 host p06253939s09190
24 5.45999 osd.24 down 0 1.00000
405 5.45999 osd.405 down 0 1.00000
-9 786.23901 rack RA13
-11 131.03999 host p06253939b84659
101 5.45999 osd.101 down 0 1.00000
-32 131.03999 host p06253939u19068
577 5.45999 osd.577 down 0 1.00000
-14 895.43903 rack RA17
-34 125.58000 host p06253939f99921
742 5.45999 osd.742 down 0 1.00000
-22 125.58000 host p06253939h70655
646 5.45999 osd.646 down 0 1.00000
659 5.45999 osd.659 down 0 1.00000
718 5.45999 osd.718 down 0 1.00000
-26 131.03999 host p06253939v20205
650 5.45999 osd.650 down 0 1.00000
-33 131.03999 host p06253939w66726
362 5.45999 osd.362 down 0 1.00000
654 5.45999 osd.654 down 0 1.00000
- Check the tickets for the machines in Service Now. Those who interest us are the named :
[GNI] exception.scsi_blockdevice_driver_error_reported
orexception.nonwriteable_filesystems
.- If the repair service replaced the disk(s), it will be written in the ticket. So you can continue on the next step.
On the OSD:
LVM formatting using ceph-volume
- Simple format: osd as logical volume of one disk
This is a sample output of listing the disks in lvm fashion. You will notice the number of devices (disks) in each osd is one. Also these devices don't use any ssds for performance boost.
(Ceph volume listing takes some time to complete)
[13:55][root@p05972678e21448 (production:ceph/erin/osd*30) ~]# ceph-volume lvm list
===== osd.335 ======
[block] /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
type block
osd id 335
cluster fsid eecca9ab-161c-474c-9521-0e5118612dbb
cluster name ceph
osd fsid c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
encrypted 0
cephx lockbox secret
block uuid PXCHQW-4aXo-isAR-NdYU-3FQ2-E18Q-whJa92
block device /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
vdo 0
crush device class None
devices /dev/sdw
===== osd.311 ======
[block] /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e
type block
osd id 311
cluster fsid eecca9ab-161c-474c-9521-0e5118612dbb
cluster name ceph
osd fsid 1bfad506-c450-4116-8ba5-ac356be87a9e
encrypted 0
cephx lockbox secret
block uuid O5fYcf-aGW8-NVWC-lr5G-BsuC-Yx3H-WZl24a
block device /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e
vdo 0
crush device class None
devices /dev/sdt
This is an example of an osd that uses an ssd for its metadata. It has a db part in which the metadata is stored.
[14:04][root@p06253939e35392 (production:ceph/dwight/osd*24) ~]# ceph-volume lvm list
====== osd.29 ======
[block] /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48
type block
osd id 29
cluster fsid dd535a7e-4647-4bee-853d-f34112615f81
cluster name ceph
osd fsid dff889e7-5db5-4c5e-9aab-151e8ad17b48
db device /dev/sdac3
encrypted 0
db uuid 9762cd49-8f1c-4c29-88ca-ff78f6bdd35c
cephx lockbox secret
block uuid HuzwbL-mVvi-Ubve-1C5D-fjeh-dmZq-ivNNnY
block device /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48
crush device class None
devices /dev/sdk
[ db] /dev/sdac3
PARTUUID 9762cd49-8f1c-4c29-88ca-ff78f6bdd35c
====== osd.88 ======
[block] /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558
type block
osd id 88
cluster fsid dd535a7e-4647-4bee-853d-f34112615f81
cluster name ceph
osd fsid f19541f6-42b2-4612-a700-ec5ac8ed4558
db device /dev/sdab6
encrypted 0
db uuid f0b652e1-0161-4583-a50b-45a0a2348e9a
cephx lockbox secret
block uuid cHqcZG-wsON-P9Lw-4pTa-R1pd-GUwR-iqCMBg
block device /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558
crush device class None
devices /dev/sdu
[ db] /dev/sdab6
PARTUUID f0b652e1-0161-4583-a50b-45a0a2348e9a
One way is to have an ssd and do a simple partitioning. Each partition will be attached to an osd. If the ssd part is broken, eg the disk failed, all the osds who use this ssd will be rendered useless. Therefore each osd has to be replaced. There is also a chance that the ssd is formatted through lvm, the metadata database part will look like this:
[ db] /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
type db
osd id 220
cluster fsid e7681812-f2b2-41d1-9009-48b00e614153
cluster name ceph
osd fsid 81f9ed48-d27d-44b6-9ac0-f04799b5d0d5
db device /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
encrypted 0
db uuid wZnT18-ZcHh-jcKn-pHic-LCrL-Gpqh-Srq1VL
cephx lockbox secret
block uuid z8CwSn-Iap5-3xDX-v0NA-9smI-EHx1-EGxdAR
block device /dev/ceph-6bb8b94b-4974-44b0-ae6c-667896807328/osd-a59a8661-c966-443b-9384-b2676a3d42d8
vdo 0
crush device class None
devices /dev/md125
Replacement procedure: one disk per osd
ceph-volume lvm list
is slow, save its output to~/ceph-volume.out
and work with that file instead.- Check if the ssd device exists and it is failed.
- Check if it is used as a metadata database for osds, or as a regular osd.
- If it is a metadata database:
- Locate all osds that use it (lvm list + grep)
- Follow the procedure for each affected osd
- Treat it as a regular osd (normal replacement)
- If it is a metadata database:
- Mark out the osd:
ceph osd out $OSD_ID
- Destroy the osd:
ceph osd destroy $OSD_ID --yes-i-really-mean-it
- Stop the osd daemon:
systemctl stop ceph-osd@OSD_ID
- Unmount the filesystem:
umount /var/lib/ceph/osd/ceph-$OSD_ID
- If the osd has uses a metadata database (on ssds)
- If it is a regular partition, remove the partition I guess
- If it's an lvm, remove it:
- eg for "/dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85"
lvremove cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
- Run
ceph-volume lvm zap /dev/sdXX --destroy
- In case
ceph-volume
fails to list the defected devices or zap the disks. You can get the information you need throughlvs -o devices,lv_tags | grep type=block
and usevgremove
instead for the osd block. - In case you can't get any information through
ceph-volume
orlvs
about the defective devices, you should list the working osds andumount
the unused folders with:$ umount `ls -d -1 /var/lib/ceph/osd/* | grep -v -f <(grep -Po '(?<=osd\.)[0-9]+' ceph-volume.out)`
- Now you should wait until the devices have been replaced. Skip this meaningful step if they are already replaced.
- If the osd has any metadata database used elsewhere (ssd) you should prepare it in case of lvm.
For naming we use
cache-`uuid -v4`
. Just recreate the lvm you removed at step 7 with:lvcreate -name $name -l 100%FREE $VG
. Lvm has three categories, PVs which are the physical devices (e.g /dev/sda), VGs which are the volume groups that contain one or more physical devices. And LVs are the "paritions" of VGs. For simplicity we use 1 PV per VG, and one LV per VG. In case you have more than one LV per VG, when you recreate it use e.g for 4 LVs per VG25%VG
instead of100%FREE
. - Recreate the OSD using ceph volume, use a destroyed osd's id from the same host
$ ceph-volume lvm create --bluestore --data /dev/sdXXX --block.db (VG/LV or ssd partition) --osd-id XXX
Replacement procedure: two disks striped (raid 0) per osd
- Run this script with the defective device
ceph-scripts/tools/lvm_stripe_remove.sh /dev/sdXXX
(it doesn't take a list of devices) - The program will report what cleanup it did, you will need the 2nd and 3rd line, which are the two disks that make the failed osd and the last line which is the osd id.
- In any case the script failed, you can open it, as it is documented and follow the steps manually.
- If you have more than one osd replacement you can repeat steps 1 and 2, as the 5th step can done in the end.
- Pass the set of disks from step 1 after you have all of them working on this script:
It usesceph-scripts/ceph-volume/striped-osd-prepare.sh /dev/sd[a-f]
ls
inside so you can use wildcards if you are bored to write '/dev/sdX' all the time. - It will output a list of commands to be executed in order, run all EXCEPT THE
ceph-volume create
one. Add at the end of theceph-volume create
line the argument--osd-id XXX
with the number of the destroyed osd id, and run the command.
Retrieve metadata information from Openstack
Ceph is tightly integrated with Openstack and it is this latter the main access point to the storage from the user perspective. As a result, Openstack is the main source of information for the data stored on Ceph: Project names, project owners, quotas, etc. Some noteble exceptions remain, for example local S3 accounts on Gabe and the whole Nethub cluster.
This page collects some example of what it is possible to retrieve from Openstack to know better the storage items we manage.
The magic "services" project
To gain visibility on the metadata stored by Openstack, it is needed to have access to the services
project in Openstack. Typically all members of ceph-admins
are part of it. services
is a special project with storage administrator capabilities that allows to retrieve various pieces of information on the whole Openstack instance and on existing projects, compute resources, storage, etc...
Use the services
project simply by setting:
OS_PROJECT_NAME=services
Openstack Projects
Get the list of openstack projects with their names and IDs:
[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack project list | head -n 10
+--------------------------------------+------------------------------------------------------------------+
| ID | Name |
+--------------------------------------+------------------------------------------------------------------+
| 0000d664-f697-423b-8595-57aea89be355 | Stuff... |
| 0007808b-2f41-41c5-bd7c-3bd1f1f94cb2 | Other stuff... |
| 00100a6d-b71c-415d-9dbc-3f78c2b8372a | Stuff continues... |
| 001d902d-f76e-4222-a5d0-ca6529e8221f | ... |
| 0026e800-f134-4622-b0ef-4a03283a3965 | ... |
| 00292adf-92ad-4815-966c-a9296266b0a0 | ... |
| 004b5668-4ebe-418d-83bc-1cdadf059c85 | ... |
Get details of a project:
[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack project show 5d8ea54e-697d-446f-98f3-da1ce8f8b833
+-------------+--------------------------------------+
| Field | Value |
+-------------+--------------------------------------+
| chargegroup | af9298f2-041b-0944-7904-3b41fde4f97f |
| chargerole | default |
| description | Ceph Storage Service |
| domain_id | default |
| enabled | True |
| fim-lock | True |
| fim-skip | True |
| id | 5d8ea54e-697d-446f-98f3-da1ce8f8b833 |
| is_domain | False |
| name | IT Ceph Storage Service |
| options | {} |
| parent_id | default |
| tags | ['s3quota'] |
| type | service |
+-------------+--------------------------------------+
Identify the owner of a project:
[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack role assignment list --project 5d8ea54e-697d-446f-98f3-da1ce8f8b833 --names --role owner
+-------+------------------+-------+---------------------------------+--------+--------+-----------+
| Role | User | Group | Project | Domain | System | Inherited |
+-------+------------------+-------+---------------------------------+--------+--------+-----------+
| owner | dvanders@Default | | IT Ceph Storage Service@Default | | | False |
+-------+------------------+-------+---------------------------------+--------+--------+-----------+
Openstack Volumes
List the RBD volumes in a project:
[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack volume list --project 5d8ea54e-697d-446f-98f3-da1ce8f8b833
+--------------------------------------+--------------------+-----------+------+---------------------------------------------------------------+
| ID | Name | Status | Size | Attached to |
+--------------------------------------+--------------------+-----------+------+---------------------------------------------------------------+
| 5143d9e4-8470-4ac4-821e-57ef99f24060 | buildkernel | in-use | 200 | Attached to 8afce55e-313f-432c-a764-b0ada783a268 on /dev/vdb |
| c0f1a9f7-8308-412a-92da-afcc20db3c4c | clickhouse-data-01 | available | 500 | |
| 53406846-445f-4f47-b4c5-e8558bb1bbed | cephmirror-io1 | in-use | 3000 | Attached to dfc9a14a-ff4b-490a-ab52-e6c9766205ad on /dev/vdc |
| c2c31270-0b95-4e28-9ac0-6d9876ea7f32 | metrictank-data-01 | in-use | 500 | Attached to fbdff7a0-7b5b-47c0-b496-5a8afcc8e528 on /dev/vdb |
+--------------------------------------+--------------------+-----------+------+---------------------------------------------------------------+
Show details of a volume:
[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack volume show c0f1a9f7-8308-412a-92da-afcc20db3c4c
+--------------------------------+-------------------------------------------+
| Field | Value |
+--------------------------------+-------------------------------------------+
| attachments | [] |
| availability_zone | ceph-geneva-1 |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2021-11-04T08:34:51.000000 |
| description | |
| encrypted | False |
| id | c0f1a9f7-8308-412a-92da-afcc20db3c4c |
| migration_status | None |
| multiattach | False |
| name | clickhouse-data-01 |
| os-vol-host-attr:host | cci-cinder-qa-w01.cern.ch@beesly#standard |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 5d8ea54e-697d-446f-98f3-da1ce8f8b833 |
| properties | |
| replication_status | None |
| size | 500 |
| snapshot_id | None |
| source_volid | None |
| status | available |
| type | io1 |
| updated_at | 2021-11-04T08:35:15.000000 |
| user_id | tmourati |
+--------------------------------+-------------------------------------------+
Show the snapshots for a volume in a project:
[ebocchi@aiadm84 ~]$ OS_PROJECT_NAME=services openstack volume snapshot list --project 79b9e379-f89d-4b3a-9827-632b9bf16e98 --volume d182a910-b40a-4dc0-89b7-890d6fa01efd
+--------------------------------------+-------------------+-------------+-----------+-------+
| ID | Name | Description | Status | Size |
+--------------------------------------+-------------------+-------------+-----------+-------+
| 798d06dc-6af4-420d-89ce-1258104e1e0f | snapv_webstuff03 | | available | 30000 |
+--------------------------------------+-------------------+-------------+-----------+-------+
Whatchers preventing images to be deleted
OpenStack colleagues might report problems purging images
[root@cci-cinder-u01 ~]# rbd -c /etc/ceph/ceph.conf --id volumes --pool volumes trash ls
2ccb86bd4fca85 volume-3983f035-a47f-46e8-868c-04d2345c3786
5afa5e5a07b8bc volume-02d959fe-a693-4acb-95e2-ca04b965389b
8df764f0d51e64 volume-eb48e00f-ea31-4d28-91a1-4f8319724da7
99e74530298e95 volume-18fbb3e6-fb37-4547-8d27-dcbc5056c2b2
ebcc84aa45a3da volume-821b9755-dd42-4bf5-a410-384339a2d9f0
[root@cci-cinder-u01 ~]# rbd -c /etc/ceph/ceph.conf --id volumes --pool volumes trash purge
2021-02-17 15:42:46.911 7f674affd700 -1 librbd::image::PreRemoveRequest: 0x7f6744001880 check_image_watchers: image has watchers - not removing
Removing images: 0% complete...failed.
Find out which are the watchers with using the identifier on the left-hand side:
[15:52][root@p05517715y58557 (production:ceph/beesly/mon*2:peon) ~]# rados listwatchers -p volumes rbd_header.2ccb86bd4fca85
watcher=188.184.103.106:0/964233084 client.634461458 cookie=140076936413376
Get in touch with the owner of the machine. The easiest way to fix stuck watchers is to reboot the machine.
Further information (might require untrash) about the volume can be found with
[18:31][root@p05517715y58557 (production:ceph/beesly/mon*2:peon) ~]# rbd info volumes/volume-00067659-3d1e-4e22-a5d7-212aba108500
rbd image 'volume-00067659-3d1e-4e22-a5d7-212aba108500':
size 500 GiB in 128000 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: e8df4c4fe1aa8f
block_name_prefix: rbd_data.e8df4c4fe1aa8f
format: 2
features: layering, striping, exclusive-lock, object-map
op_features:
flags:
stripe unit: 4 MiB
stripe count: 1
and with (no untrash required)
[18:32][root@p05517715y58557 (production:ceph/beesly/mon*2:peon) ~]# rados stat -p volumes rbd_header.e8df4c4fe1aa8f
volumes/rbd_header.e8df4c4fe1aa8f mtime 2020-11-23 10:25:56.000000, size 0
Unpurgeable RBD image in trash
We have seen a case of an image in Beesly's trash that cannot be purged:
# rbd --pool volumes trash ls
5afa5e5a07b8bc volume-02d959fe-a693-4acb-95e2-ca04b965389b
# rbd --pool volumes trash purge
Removing images: 0% complete...failed.
2021-03-10 13:58:42.849 7f78b3fc9c80 -1 librbd::api::Trash: remove:
error: image is pending restoration.
When trying to delete manually, it says there are some watchers, but this is actually not the case:
# rbd --pool volumes trash remove 5afa5e5a07b8bc
rbd: error: image still has watchers2021-03-10 14:00:21.262 7f93ee8f8c80
-1 librbd::api::Trash: remove: error: image is pending restoration.
This means the image is still open or the client using it crashed. Try
again after closing/unmapping it or waiting 30s for the crashed client
to timeout.
Removing image:
0% complete...failed.
# rados listwatchers -p volumes rbd_header.5afa5e5a07b8bc
#
This has been reported upstream. Check:
- ceph-users with subject "Unpurgeable rbd image from trash"
- ceph-tracker https://tracker.ceph.com/issues/49716
The original answer was
$ rados -p volumes getomapval rbd_trash id_5afa5e5a07b8bc key_file $ hexedit key_file ## CHANGE LAST BYTE FROM '01' to '00' $ rados -p volumes setomapval rbd_trash id_5afa5e5a07b8bc --input-file key_file $ rbd trash rm --pool volumes 5afa5e5a07b8bc
To unstuck the image and make it purgeable
- Get the value for its ID in rdb_trash
# rbd -p volumes trash ls
5afa5e5a07b8bc volume-02d959fe-a693-4acb-95e2-ca04b965389b
[09:42][root@p05517715d82373 (qa:ceph/beesly/mon*2:peon) ~]# rados -p volumes getomapval rbd_trash id_5afa5e5a07b8bc key_file
Writing to key_file
- Make a safety copy of the original key_file
# cp -vpr key_file key_file_master
- Edit the key_file with an hex editor and change the last byte from '01' to '00'
# hexedit key_file
- Make sure the edited file contains only that change
# xxd key_file > file
# xxd key_file_master > file_master
# diff file file_master
5c5
< 0000040: 2a60 09c5 d416 00 *`.....
---
> 0000040: 2a60 09c5 d416 01 *`.....
- Set the edited file to be the new value
# rados -p volumes setomapval rbd_trash id_5afa5e5a07b8bc < key_file
- Get it back and check that the last byte is now '00'
# rados -p volumes getomapval rbd_trash id_5afa5e5a07b8bc
value (71 bytes) :
00000000 02 01 41 00 00 00 00 2b 00 00 00 76 6f 6c 75 6d |..A....+...volum|
00000010 65 2d 30 32 64 39 35 39 66 65 2d 61 36 39 33 2d |e-02d959fe-a693-|
00000020 34 61 63 62 2d 39 35 65 32 2d 63 61 30 34 62 39 |4acb-95e2-ca04b9|
00000030 36 35 33 38 39 62 12 05 2a 60 09 c5 d4 16 12 05 |65389b..*`......|
00000040 2a 60 09 c5 d4 16 00 |*`.....|
00000047
- Now you can finally purge the image
# rbd -p volumes trash purge
Removing images: 100% complete...done.
# rbd -p volumes trash ls
#
Undeletable image due to linked snapshots
We had a ticket (RQF2003413) of a user unable to delete a volume because of linked snapshots.
Dump the RBD info available on CEPH using the volume ID (see openstack_info of the undeletable volume:
[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical info --image volume-d182a910-b40a-4dc0-89b7-890d6fa01efd
rbd image 'volume-d182a910-b40a-4dc0-89b7-890d6fa01efd':
size 29 TiB in 7680000 objects
order 22 (4 MiB objects)
snapshot_count: 1
id: 457afdd323be829
block_name_prefix: rbd_data.457afdd323be829
format: 2
features: layering
op_features:
flags:
access_timestamp: Fri Mar 25 12:19:12 2022
The snapshot_count
reports 1
, which indicates one snapshot exists for the volume.
Now, list the snapshots for the undeletable volumes:
[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical snap ls --image volume-d182a910-b40a-4dc0-89b7-890d6fa01efd
SNAPID NAME SIZE PROTECTED TIMESTAMP
37 snapshot-798d06dc-6af4-420d-89ce-1258104e1e0f 29 TiB yes
In turn, it is possible to create volumes from snapshots. To check if they exist, list the child(ren) volume(s) from snapshots
[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical children --image volume-d182a910-b40a-4dc0-89b7-890d6fa01efd --snap snapshot-798d06dc-6af4-420d-89ce-1258104e1e0f
cinder-critical/volume-b9d0035f-857c-46b6-b614-4480c462d306
This last is a brand new volume, that still keeps a reference to the snapshot it originates from:
[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical info --image volume-b9d0035f-857c-46b6-b614-4480c462d306
rbd image 'volume-b9d0035f-857c-46b6-b614-4480c462d306':
size 29 TiB in 7680000 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 7f8067e3510b0d
block_name_prefix: rbd_data.7f8067e3510b0d
format: 2
features: layering, striping, exclusive-lock, object-map
op_features:
flags:
access_timestamp: Fri Mar 25 12:20:51 2022
modify_timestamp: Fri Mar 25 12:36:48 2022
parent: cinder-critical/volume-d182a910-b40a-4dc0-89b7-890d6fa01efd@snapshot-798d06dc-6af4-420d-89ce-1258104e1e0f
overlap: 29 TiB
stripe unit: 4 MiB
stripe count: 1
The parent
field shows the volume comes from a snapshot, which cannot be deleted as the volume-from-snapshot is implemented as copy-on-write (see overlap: 29 TiB
) via RBD layering.
Openstack can flatten volumes-from-snapshots in case these need to be made independent from the parent. Alternatively, to delete to parent volume, it is required to delete both the volume-from-snapshot and the snapshot.
Large omap object warning due to bucket index over limit
Large omap objects trigger HEALTH WARN
messages and can be due to poorly sharded bucket indexes.
The following example report about a over-limit bucket on nethub detected on 2021/05/21.
- Look for
Large omap object found.
in ceph logs (/var/log/ceph/ceph/log
):
2021-05-21 04:34:00.879483 osd.867 (osd.867) 240 : cluster [WRN] Large omap object found. Object: 7:7bae080b:::.dir.fe32212d-631b-44fe-8d35-03f5a3551af1.142704632.29:head PG: 7.d01075de (7.de) Key count: 610010 Size (bytes): 198156342
2021-05-21 04:34:11.622372 mon.cephnethub-data-c116fa59b2 (mon.0) 659324 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)
These lines show that:
- The pool suffering from the problem is pool number
7
- The PG suggering is
7.de
- The shared object is a bucket index:
.dir.
represents bucket indexes - The affected bucket has id
fe32212d-631b-44fe-8d35-03f5a3551af1.142704632.29
(sadly , there is no way to map it to a name)
To verify this is actually a bucket index, one can also check what pool #7 stores:
[14:21][root@cephnethub-data-98ab89f75a (production:ceph/nethub/mon*27:peon) ~]# ceph osd pool ls detail | grep "pool 7"
pool 7 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 256 pgp_num 256 last_change 30708 lfor 0/0/2063 flags hashpspool,nodelete,nopgchange,nosizechange stripe_width 0 application rgw
-
Run
radosgw-admin bucket limit check
to see how bucket index sharding is doing. It might take a while, it is recommended to dump to file. -
Check the output of
radosgw-admin bucket limit check
and look for buckets withOVER
"fill_status":
{
"bucket": "cboxbackproj-cboxbackproj-sftnight-lgdocs",
"tenant": "",
"num_objects": 767296,
"num_shards": 0,
"objects_per_shard": 767296,
"fill_status": "OVER 749%"
},
- Check in the radosgw logs (please, use
mco
to look through all the RGWs) if the radosgw process has tried to reshard the bucket recently but did not manage. Example:
ceph-client.rgw.cephnethub-data-0509dffff2.log-20210514.gz:2021-05-13 12:19:40.316 7fd2ce2a4700 1 check_bucket_shards bucket cboxbackproj-sftnight-lgdocs need resharding old num shards 0 new num sh
ards 18
ceph-client.rgw.cephnethub-data-0509dffff2.log-20210514.gz:2021-05-13 12:20:12.624 7fd2cd2a2700 0 NOTICE: resharding operation on bucket index detected, blocking
ceph-client.rgw.cephnethub-data-0509dffff2.log-20210514.gz:2021-05-13 12:20:12.625 7fd2cd2a2700 0 RGWReshardLock::lock failed to acquire lock on cboxbackproj-sftnight-lgdocs:fe32212d-631b-44fe-8d35-
03f5a3551af1.142705079.19 ret=-16
This only applies whether dynamic resharding is enabled:
[14:27][root@cephnethub-data-0509dffff2 (qa:ceph/nethub/traefik*26) ~]# cat /etc/ceph/ceph.conf | grep resharding
rgw dynamic resharding = true
- Reshard the bucket index manually:
radosgw-admin reshard add --bucket cboxbackproj-cboxbackproj-sftnight-lgdocs --num-shards 18
- The number of shards can be inferred from the logs inspected at point 4.
-i If dynamic resharding is disable, a little math is required. Check the bucket stats (
radosgw-admin bucket stats --bucket <bucket_name>
) and make sureusage --> rgw.main --> num_objects
divided by the number of shards does not exceed 100000 (50000 is recommended).
Example:
[14:29][root@cephnethub-data-98ab89f75a (production:ceph/nethub/mon*27:peon) ~]# radosgw-admin bucket stats --bucket cboxbackproj-cboxbackproj-sftnight-lgdocs
{
"bucket": "cboxbackproj-cboxbackproj-sftnight-lgdocs",
[...]
"usage": {
"rgw.main": {
"size": 4985466767640,
"size_actual": 4987395952640,
"size_utilized": 4985466767640,
"size_kb": 4868619891,
"size_kb_actual": 4870503860,
"size_kb_utilized": 4868619891,
"num_objects": 941202
}
},
}
with 941202 / 18 = 52289
5b. Once added the bucket to be reshared, start the reshard process:
radosgw-admin reshard list
radosgw-admin reshard process
-
Check after some time that the
radosgw-admin bucket stats --bucket <bucket_name>
reports the right number of shards and thatradosgw-admin bucket limit check
no longer showsOVER
orWARNING
for the re-sharded bucket. -
To clear the
HEALTH_WARN
message for the large omap object, start a deep scrub on the affected pg:
[14:31][root@cephnethub-data-98ab89f75a (production:ceph/nethub/mon*27:peon) ~]# ceph pg deep-scrub 7.de
instructing pg 7.de on osd.867 to deep-scrub
Ceph logging [WRN] evicting unresponsive client
This warning shows that a client stopped responding to messages from the MDS. Sometimes it is harmless (perhaps a client disconnected "uncleanly", e.g. a hard reboot), or it could indicate the client is overloaded, deadlocked on something else.
If the same client is appearing repeatedly, it may be useful to get in touch with the owner of the client machine. (ai-dump <hostname>
on aiadm).
Ceph logging [WRN] clients failing to respond to cache pressure
When the MDS cache is full, it will need to clear inodes from its cache. This normally also means that the MDS needs to ask some clients to also remove some inodes from their cache too.
If the client fails to respond to this cache recall request, then Ceph will log this warning.
Clients stuck in this state for an extended period of time can cause issues -- follow up with the machine owner to understand the problem.
Note: Ceph-fuse v13.2.1 has a bug which triggers this issue -- users should update to a newer client release.
Ceph logging [WRN] client session with invalid root denied
This means that a user is trying to mount a Manila share that either doesn't exist or they didn't create a key yet. It is harmless, but if repeated then get in touch with the user.
Procedure to unblock hung HPC writes
An HPC client was stuck like this for several hours:
HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
report slow requests
MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02
failing to respond to capability release client_id: 69092525
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30 sec
Indeed there was a hung write on hpc070.cern.ch:
# cat /sys/kernel/debug/ceph/*/osdc
245540 osd100 1.9443e2a5 1.2a5 [100,1,75]/100 [100,1,75]/100
e74658 fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.00000001
0x400024 1 write
I restarted osd.100 and the deadlocked request went away.
S3 Operations notes
Note: If you are looking for the old notes related to the infrasctructure based on consul and nomad, please refer to old documentation.
About the architecture
The CERN S3 service (s3.cern.ch) is provided by the gabe
cluster and an arbitrary number of radosgw
running on VMs. Each node in the ceph/gabe/radosgw
hostgroup also runs a reverse-proxy daemon (Træfik), to spread the load on the VMs running a radosgw
and to route traffic to different dedicated RGWs (cvmfs, gitlab, ...).
A second S3 cluster (s3-fr-prevessin-1.cern.ch) is also available in Prevessin Nethub Hub (nethub
).
Both clusters (as of July 2021) use similar technologies: Ceph, RGWs, Træfik, Logstash, ....
Components
- RadosGW: Daemon handling S3 requests and interacting with the Ceph cluster
- Træfik: Handles HTTP(S) requests from the Internet and spreads the load on
radosgw
daemons. - Logstash: Sidecar process that ships the access logs produced by Træfik to the MONIT infrastructure.
Useful documentation
- Upstream RadosGW documentation: (https://docs.ceph.com/en/nautilus/radosgw/)
- Upstream documentation on
radosgw-admin
tool: (https://docs.ceph.com/en/nautilus/man/8/radosgw-admin/) - Træfik documentation: (https://docs.Træfik.io/)
- S3 Script guide: (https://gitlab.cern.ch/ceph/ceph-guide/-/blob/master/src/ops/s3-scripts.md)
Dashboards
- Træfik: http://s3.cern.ch/traefik/ (requires basic auth)
- ElasticSearch for access logs: https://es-ceph.cern.ch/ (from CERN network only)
- Various S3 dashboards (and underlying Ceph clusters) on Filer Carbon
- Buckets rates (and others) on Monit Grafana
Maintenance Tasks
Removal of one Træfik/RGW machine from the cluster
Each machine running Træfik/RGW is:
-
Part of the s3.cern.ch alias (managed by
lbclient
), with Træfik accepting connections on port 80 and 443 for HTTP and HTTPS, respectively -
A backend RadosGW for all the Træfiks of the cluster, with the Ceph RadosGW daemon accepting connections on port 8080
-
To remove a machine from
s3.cern.ch
,touch /etc/nologin
or change the roger status to intervention/disabled (roger update --appstate=intervention <hostname>
). This will makelbclient
return a negative value and the machine will be removed from the alias. -
To remove temporarily a RadosGW from the list of backends (e.g., for a cluster upgrade),
touch /etc/nologin
and the RadosGW process will return503
for requests to/swift/healthcheck
. This path is used by Træfik healthcheck and, if the return code is different from200
, Træfik will stop sending requests to that backend. Wait few minutes to let in-flight requests complete, then restart the RadosGW process without clients noticing. See Pull Request to implement healthcheck disabling path. -
To remove permanently a RadosGW from the list of backends (e.g., decommissioning), change the Træfik dynamic configuration via puppet in Træfik.yaml by removing the machine from the servers list. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm:
[ebocchi@aiadm81 ~]$ mco puppet runonce -T ceph -F hostgroup=ceph/gabe/radosgw/Træfik --dt=3
* [ ============================================================> ] 14 / 14
Finished processing 14 / 14 hosts in 114.60 ms
Create a new Træfik/RGW VM
- Spawn a new VM with the script cephgabe-rgwtraefik-create.sh from aiadm
- Wait for the VM to be online and run puppet several times so that the configuration is up to date
- Make sure you have received the email confirming the VM has been added to the firewall set (and so it is reachable from the big Internet)
- Make sure the new VM serves requests as expects (Test IPv4 and IPv6, HTTP and HTTPS):
curl -vs --resolve s3.cern.ch:{80, 443}:<ip_address_of_new_VM> http(s)://s3.cern.ch/cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished --output -
- Add the VM to the Prometheus
s3_lb
job (see prometheus puppet config) to monitor its availability and collect statistics on failed (HTTP 50*) requests - Change the roger status to production and enable all alarms. The machine will now be part of the
s3.cern.ch
alias - Update the Træfik dynamic configuration via puppet in Træfik.yaml by adding the new backend to the servers list. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm (see above).
Change/Add/Remove the backend RadosGWs
- Edit the list of backend nodes in the Træfik dynamic configuration via puppet in Træfik.yaml by adding/removing/shuffling around the server. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm (see above).
- If adding/removing, make sure the list of monitored endpoints by Prometheus is up to date. See prometheus puppet config.
Change Træfik TLS certificate
The certificate is provided by CDA. You should ask them to buy a new one with the correct SANs.
Once the new certificate is provided, copy-paste it on https://tools.keycdn.com/certificate-chain -- It will return a certificate chain with all the required intermediate certificates. This certificate chain is the one to be put in Teigi and be used by Træfik. Please, split it and check the validity of each certificate validity with openssl x509 -in <filename> -noout -text
. Typically, the root CA certificate, the intermediate certificate and the private key do not change.
Once validates, it should be put in Teigi under ceph/gabe/radosgw/træfik
:
- s3_ch_ssl_certificate
- s3_ch_ssl_private_key
Next, the certificate must be deployed on all machines via puppet. Mcollective can be of help to bulk-run puppet on all the Træfik machines:
[ebocchi@aiadm81 ~]$ mco puppet runonce -T ceph -F hostgroup=ceph/gabe/radosgw/Træfik --dt=3
* [ ============================================================> ] 14 / 14
Finished processing 14 / 14 hosts in 114.60 ms
Last, the certificate must be loaded by Træfik. While the certificate is part of Træfik's dynamic configuration, Træfik does not seem to reload it if the certificate file (distributed via puppet) changes on disk. Puppet will still notify the Træfik service when the certificate file changes (see traefik.pp) to no avail.
Since 2022, a configuration change in Træfik (Traefik: hot-reload certificates when touching (or editing) dynamic file) allows reloading the certificate when the Traefik dynamic configuration file changes. It is sufficient to touch /etc/traefik/traefik.dynamic.conf
to have the certificate reloaded, with no need to drain the machine and restart the Traefik process:
- Make sure the new certificate file is available on the machine (
/etc/ssl/certs/radosgw.crt
) - Tail the logs of the Traefik service:
tail -f /var/log/traefik/service.log
- Touch Traefik's dynamic configuration file:
touch /etc/traefik/traefik.dynamic.conf
- Check the new certificate is in place:
curl -vs --resolve s3.cern.ch:443:<the_ip_address_of_the_machine> https://s3.cern.ch --output /dev/null 2>&1 | grep ^* | grep date
* start date: Mar 1 00:00:00 2022 GMT
* expire date: Mar 1 23:59:59 2023 GMT
The same certificates are also used by the Nethub cluster and distributed via Teigi under ceph/nethub/traefik
:
- s3_fr_ssl_certificate
- s3_fr_ssl_private_key
Quota alerts
There is a daily cronjob that checks S3 user quota usage and sends a list of accounts reaching 90% of their quota. Upon reception of this email, we should get in touch with the user and see if they can (1) free some space by deleting unnecessary data or (2) request more space.
Currently, there is some rgw accounts that will come without an associated email address. A way to investigate who owns the account is to log into aiadm.cern.ch
and run the following commands (in /root/ceph-scripts/tools/s3-accounting/
)
./cern-get-accounting-unit.sh --id `./s3-user-to-accounting-unit.py <rgw account id>`
This will give you the user name of the associated openstack tenant's owner, with the contact email address.
Further notes on s3.cern.ch alias
The s3.cern.ch
alias is managed by aiermis and/or by the kermis
CLI utility on aiadm
[ebocchi@aiadm81 ~]$ kermis -a s3 -o read
INFO:kermis:[
{
"AllowedNodes": "",
"ForbiddenNodes": "",
"alias_name": "s3.cern.ch",
"behaviour": "mindless",
"best_hosts": 10,
"clusters": "none",
"cnames": [],
"external": "yes",
"hostgroup": "ceph/gabe/radosgw",
"id": 3019,
"last_modification": "2018-11-01T00:00:00",
"metric": "cmsfrontier",
"polling_interval": 300,
"resource_uri": "/p/api/v1/alias/3019/",
"statistics": "long",
"tenant": "golang",
"ttl": null,
"user": "dvanders"
}
]
As of July 2021, the alias returns the 10 best hosts (based on the lbclient score) out of all the machines that are part of the alias, which are typically more. Also, the members of the alias are refreshed every 5 minutes (300 seconds).
Upgrading software
Upgrade mon/mgr/osd
Follow the procedure defined for the other Ceph clusters. In a nutshell:
- Start with mons, then mgrs. OSDs go last.
- If upgrading OSDs,
ceph osd set {noin, nout}
yum update
to update the packages (check that the ceph package is actually upgraded)systemctl restart ceph-{mon, mgr, osd}
- Always make sure the daemons came back alive and all OSDs repeered before continuing with the following machine
Upgrading RGW
To safely upgrade the RadosGW, touch /etc/nologin
to have it returning 503
to the healthcheck probes from Træfik (see more about healthcheck disabling path above). This allows for draining the RadosGW by not sending new requests to it and letting in-flight ones finish gently.
After few minutes, one can assume there are no more in-flight requests and the RadosGW can be update and restarted: systemctl restart ceph-{mon, mgr, osd}
. Make sure the RadosGW came back alive by tailing the log at /var/log/ceph/ceph-client.rgw.*
; it should still return 503
to the Træfik healthchecks. Now remove /etc/nologin
and check the requests flow with 200
.
Upgrading Træfik
To safely upgrade Træfik, the frontend machine must be removed from the load-balanced alias by touching /etc/nologin
(this will also disable the RadosGW due to the healthcheck disabling path -- see above). Wait for some time and make sure no (or little) traffic is handled by Træfik by checking its access logs (/var/log/traefik/access.log
)`. Some clients (e.g., GitLab, CBack) are particularly sticky and rarely re-resolve the alias to IPs -- there is nothing you can do to push those clients away.
When no (or little) traffic goes through Træfik, update the traefik::version
parameter and run puppet. The new Træfik binary will be installed on the host and the service will be restarted.
Check with curl
that Træfik works as expected. Example:
$ curl -vs --resolve s3.cern.ch:80:188.184.74.136 http://s3.cern.ch/cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished --output -
* Added s3.cern.ch:80:188.184.74.136 to DNS cache
* Hostname s3.cern.ch was found in DNS cache
* Trying 188.184.74.136:80...
* TCP_NODELAY set
* Connected to s3.cern.ch (188.184.74.136) port 80 (#0)
> GET /cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished HTTP/1.1
> Host: s3.cern.ch
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Bucket: cvmfs-atlas
< Cache-Control: max-age=61
< Content-Length: 601
< Content-Type: application/x-cvmfs
< Date: Fri, 22 Apr 2022 14:45:27 GMT
< Etag: "b5dbc3633d7bb27d10610f5f1079a192"
< Last-Modified: Fri, 22 Apr 2022 14:11:10 GMT
< X-Amz-Request-Id: tx00000000000000143ffd3-006262bf87-28e3e206-default
< X-Rgw-Object-Type: Normal
<
Ca5b48a4ed8f0ca46b79584104564da32b42a1c45
B1385472
Rd41d8cd98f00b204e9800998ecf8427e
D240
S103476
Gno
Ano
Natlas.cern.ch
{...cut...}
* Connection #0 to host s3.cern.ch left intact
If successful, allow the machine to join the load-balanced pool by removing /etc/nologin
.
S3 radosgw-admin operations
radosgw-admin
is used to manage users, quotas, buckets, indexes, and all other aspects of the radosgw service.
Create a user
End-users get S3 quota from OpenStack (see Object Storage).
In special cases (e.g., Atlas Event Index, CVMFS Stratum 0s, GitLab, Indico, ...), we create users that exist only in Ceph and are not managed by OpenStack. To create a new user of this kind, it is needed to know user_id, email address, display name, quota (optional).
Create the user with:
radosgw-admin user create --uid=<user_id> --email=<email_address> --display-name=<display_name>
To set a quota for the user:
radosgw-admin quota set --quota-scope=user --uid=<user_id> --max-size=<quota>
radosgw-admin quota enable --quota-scope=user --uid=<user_id>
Example:
radosgw-admin user create --uid=myuser --email="myuser@cern.ch" --display-name="myuser"
radosgw-admin quota set --quota-scope=user --uid=myuser --max-size=500G
radosgw-admin quota enable --quota-scope=user --uid=myuser
Change user quota
It is sufficient to set the updated quota value for the user:
radosgw-admin quota set --quota-scope=user --uid=<user_id> --max-size=<quota>
Bucket resharding
RGW shards bucket indices over several objects. The default number of shards per index is 32 in our clusters. It is best practice to keep the number of objects per shard below 100000. You can check the compliance across all buckets with radosgw-admin bucket limit check
.
If there is a bucket with "fill_status": "OVER 100.000000%"
then it should be resharded. E.g.
> radosgw-admin bucket reshard --bucket=lhcbdev-test --num-shards=128
tenant:
bucket name: lhcbdev-test
old bucket instance id: 61c59385-085d-4caa-9070-63a3868dccb6.24333603.1
new bucket instance id: 61c59385-085d-4caa-9070-63a3868dccb6.76824996.1
total entries: 1000 2000 ... 8599000 8599603
2019-06-17 09:27:47.718979 7f2b7665adc0 1 execute INFO: reshard of bucket "lhcbdev-test" from "lhcbdev-test:61c59385-085d-4caa-9070-63a3868dccb6.24333603.1" to "lhcbdev-test:61c59385-085d-4caa-9070-63a3868dccb6.76824996.1" completed successfully
SWIFT protocol for quota information
It is convenient to use the SWIFT protocol to retrieve quota information.
- Create the SWIFT user as a subuser:
radosgw-admin subuser create --uid=<user_id> --subuser=<user_id>:swift --access=full
This generates a secret key that can be used on the client side to authenticate with SWIFT.
- On clients, install the
swift
package (provided in the OpenStack Repo on linuxsoft) and retrieve quota information with
swift \
-V 1 \
-A https://s3.cern.ch/auth/v1.0 \
-U <user_id>:swift \
-K <secret_key> \
stat
S3 logging
Access logs from Træfik reverse-proxy are collected via a side-car process called fluentbit.
It pushes the logs to Monit Logs infrastructure for later processing by Logstash for filtering and enrichment running on Monit Marathon.
Eventually, logs are then pushed to HDFS (/project/monitoring/archive/s3/logs
) and to Elasticsearch for storage and visualization.
fluentbit on S3 RadosGWs
Since late April 2022, we use fluentbit on RadosGWs+Træfik frontends as it is much more gentle on memory than Logstash (which we were using previously).
fluentbit tails the log files produced by Træfik (both HTTP access logs and Træfik daemon logs), add a few fields and context through metadata, and pushes the records to the Monit Logs infrastructure at URI monit-logs-s3.cern.ch:10013/s3
using TLS encryption.
It is installed via puppet (exmaple for Gabe) by using the shared class fluentbit.pp responsible for installation and configuration of the fluentbit service.
fluentbit on the RadosGWs+Træfik frontends is configured to tail two input files, namely the access (/var/log/traefik/access.log
) and the daemon (/var/log/traefik/service.log
) logs of Træfik. Logs from the access (daemon) file are tagged as traefik.access.*
(traefik.service.*
), labelled as s3_access
(s3_daemon
). Before sending to the Monit infrastructure, the message is prepared to define the payload data and metadata (see monit.lua):
producer
iss3
(used to build path on HDFS) -- must be whitelisted on the Monit infra;type
defines if the logs are access or daemon (used to build path on HDFS);index_prefix
defines the index for the logs (is used by Logstashon Monit Marathon and on Elasticsearch).
Logstash on Monit Marathon
Logstash is the tool that reads the aggregated log stream from Kafka, does most of the transformation and writes to Elasticsearch.
This Logstash process runs in a Docker container on the Monit Marathon cluster (see Applications --> storage --> s3logs-to-es).
For debugging purposes, stdout
and stderr
of the container are available on monit-spark-master.cern.ch:5050/ -- They do not work from Marathon.
The Dockerfile, configuration pipeline, etc., are stored in s3logs-to-es.
This Logstash instance:
- removes the additional fields introduced by the Monit infrastructure (metadata unused by us)
- parses the original message as json document
- adds costing information
- adds geographical information of the client IP (geoIP)
- copies a subset of fields relevant for CSIR to a different index
- ...and pushes the results (full logs, and CSIR stripped version) to Elasticsearch
Elasticsearch
We finally have our dedicated Elasticsearch instance managed by the Elasticsearch Service.
There's not much to configure from our side, just a few useful links and the endpoint config repository:
Data is kept for:
- 10 days on fast SSD storage, local to the ES cluster
- other 20 days (30 total) on ceph storage
- 13 months (stripped-down version, some fields are filtered out -- see below) for CSIR purposes
Indexes on ES must start with ceph_s3
. This is the only whitelisted pattern, and hence the only one allowed.
We currently use different indexes:
- ceph_s3_access: Access logs for Gabe (s3.cern.ch)
- ceph_s3_daemon: Traefik service logs for Gabe
- ceph_s3_access-csir: Stripped down version of Gabe access logs for CSIR, retained for 13 months
- ceph_s3_fr_access: Access logs of Nethub (s3-fr-prevessin-1.cern.ch)
- ceph_s3_fr_daemon: Traefik service logs for Nethub
- ceph_s3_fr_access-csir: Stripped down version of Nethub access logs for CSIR, retained for 13 months
ES is also a data source for Monit grafana dashboards:
- Grafana uses basic auth to ES with user
ceph_ro:<password>
(The password is stored in Teigi:ceph/gabe/es-ceph_ro-password
) - ES must have the internal user
ceph_ro
configured with permissions to readceph*
indexes
HDFS
HDFS is solely used as a storage backed to store the logs for 13 months for CSIR purposes. As of July 2021, HDFS stores the full logs (to be verified if they do not eat too much space on HDFS). To check/read logs on HDFS, you must have access to the HDFS cluster (see prerequisites) and from lxplus
source /cvmfs/sft.cern.ch/lcg/views/LCG_99/x86_64-centos7-gcc8-opt/setup.sh
source /cvmfs/sft.cern.ch/lcg/etc/hadoop-confext/hadoop-swan-setconf.sh analytix 3.2 spark3
kinit
hdfs dfs -ls /project/monitoring/archive/s3/logs
Centos Stream 8 migration
All the information regarding centos stream 8 can be found in this document.
Upgrading from Centos 8 in place
-
Create new CS8 nodes with representative configurations and validate
-
Enable the upgrade (top-level hostgroup, sub-hostgroup, etc)
base::migrate::stream8: true
-
Follow the instructions
- Run Puppet twice.
- Run
distro-sync
. - Reboot.
CephFS Backup - cback
CephFS backups are currently added by demand and backed up automatically by our cback
orchestrator.
Backup Characteristics
- Stored in S3 Nethub cluster (Prevessin, FR)
- Backup is not consistent (no actual mount freeze or so on)
- Snapshot based, with the default (but per job configurable) retention:
- Last 7 daily snapshots
- Last 5 weekly snapshots
- Last 6 monthly snapshots
- Backup repositories are encrypted (AES-256)
- Backups are periodically verified and pruned.
Add new backup job
Only jobs from
flax
andlevinson
can we added actually.
Access cback-backup.cern.ch
and trigger the following command:
cback backup add
[--repository s3_bucket_path]
[--force]
--instance instance
--group ceph
NAME
SOURCE
Arguments:
NAME
: name that identifies the backup
SOURCE
: path of the share to backup
Flags:
--instance
: decorative, name of the instance where the source data is.
--group
: indicates the group of backups for which the backup belongs. Backups in the same group will share common configuration, S3 credentials, etc. Use always ceph
.
--force
: indicates that the backup will run every time, no matter what if they were changes or not. If not specified, backup will only trigger if the recursive mtime of the volume path is newer than the last backup snapshot.
--repository
: Override the default repository name generation which is cbackceph-<NAME>
. It has to be in the fullish qualified s3 url, Ex: s3:https://s3-fr-prevessin-1.cern.ch/cephback-my-custom-bucket-name
Example:
cback backup add --instance flax --group ceph --force alfa /cephfs-flax/volumes/_nogroup/xxxxxxxx
This will print an resume of the backup just created with the backup_id. The backup will be still disabled, so to enable it you can do:
cback backup enable <backup_id>
Please note that once enabled, the first backup will start right away if a backup agent is free, and the next will be 24h after the first finishes, and so on like this.
Enable prune. This will enable the purging of old backups using the retention policy indicated above.
cback prune enable <backup_id>
Specify backup desired start time.
If we want to have more control when a backup is performed, we can do the following:
cback backup modify <backup_id> --desired-start-time 20:00
Note: this is a desired time, not an exact time, the backup will start when there is a free agent after that time. Note: Having many backups starting at the same time could introduce load on the backend, so the recommendation is to use default scheduling unless specifically requested.
Restore data - TBD
There are many ways to restore data from a backup repository:
- Using cback asynchronous restore jobs
- Fast restore. Ideal for big restores.
- Mounting the backup repository
- Slow restore. Ideal for single files or small sets, or checking status of the backups, when is not clear what to look for.
- Using vanilla restic.
Future work will allow users to interact with the backup by themselves.