Ceph

From Leo's Notes
Last edited on 14 June 2020, at 23:32.
Ceph

Ceph is an open source distributed storage system that provides object, block, and file level storage. Its key feature is the ability to replicate data across many different systems, without a single point of failure and allowing for the use of commodity components while maintaining reliable storage.


Introduction

Ceph has 3 main components:

  1. The Client, exposes a POSIX file system interface to a host or process. Performs file IO by communication to OSDs.
  2. The Cluster of OSDs, collectively stores all data and metadata
  3. The Metadata Server, manages namespace data (filenames, directories), consistency and coherence

Concepts

Some main concepts to keep in mind when dealing with Ceph are:

OSD
Object Storage Device
Ceph OSD
Ceph Object Storage Daemon, or ceph-osd handles object data including replication, recovery, rebalancing.
MDS
Metadata Server used by Ceph Filesystem on Ceph Block Devices.
CRUSH
Controlled Replication Under Scalable Hashing. A pseudo-random data distribution function that efficiently maps each Placement Group (PG) to an ordered list of OSDs upon which to store object replicas.
OSD Cluster Map
a compact, hierarchical description of the devices comprising the storage cluster.

Client

The Ceph Client codebase runs entirely in userspace. Applications can make use of Ceph by link to a client library or accessing files through FUSE.

Ceph utilizes a range of striping strategies to map file data into a series of predictably named objects. Object names include additional metadata information such as the file inode and stripe number. Object replicas are assigned to OSDs using CRUSH.

When accessing a file, the MDS translates the filename to the predictably named objects which are then retrieved from the appropriate OSD without requiring a lookup.

Cluster of OSDs

At a high level, a cluster of OSDs provide a single logical object store for clients and metadata servers. This cluster transparently handles the responsibility of data migration, replication, failure detection, and failure recovery. The underlying technology running these clusters under Ceph is RADOS (Reliable Autonomic Distributed Object Store) which is designed to scale linearly. Each Ceph OSD in this system manages local object storage with EBOFS (Extent and B-tree based Object File System).

Different OSDs could be grouped into different failure domains (such as physical location, hardware type, etc.) to ensure that data replication happens across these groups. Such groups are called Placement Groups (PGs).

The CRUSH algorithm determines which OSDs an object should be placed in such that objects are distributed in such a way that satisfies any placement group requirements. To locate any object, CRUSH requires only the placement group and an OSD cluster map.

Data is replicated in terms of PGs, each of which is mapped to an ordered list of n OSDs (for n-way replication). Clients send all writes to the primary (first) OSD in an object's PG. This primary OSD then assigns a forwards this object to any additional replica OSDs.


Metadata Server

The Metadata Server (MDS) is diskless and serves as an index to the OSD cluster to facilitate read and write. All metadata as well as data are stored in the OSD cluster.

Typically there would be around 5 MDSs in a 400 node OSD deployment. This looks like an overkill for just providing an indexing service to the OSD cluster, but actually is required for achieving very high-scalability. Effective metadata management is critical to overall system performance because file system metadata operations make up as much as half of typical file system workloads.

Ceph can delegate MDS responsibilities based on dynamic subtree partitioning in order to dynamically load balance workloads.

Quick Installation

To give Ceph a demo, you will need to deploy: admin, mon1, osd1, osd2, osd3.

On a clean installation of CentOS on all nodes, run as root:

## Dependencies, NTP?
# yum install -y open-vm-tools

## Create a cephuser account
# useradd -d /home/cephuser -m cephuser
# echo -e "ceph\nceph" | passwd cephuser

## With sudoers access to everything
# echo "cephuser ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/cephuser
# chmod 0440 /etc/sudoers.d/cephuser
# sed -i s'/Defaults requiretty/#Defaults requiretty'/g /etc/sudoers

## Generate a new SSH key on every node for cephuser 
# echo -e "\n\n\n\n\n\n" | sudo -u cephuser ssh-keygen -t rsa

After each node has been set up, log in to each node as cephuser and run:

## Add our public key to every other node.
$ ssh-keyscan admin mon1 osd1 osd2 osd3 >> ~/.ssh/known_hosts

## Put our public key to the remote host's authorized_keys
$ for i in admin mon1 osd1 osd2 osd3 ; do ssh-copy-id $i ; done

## Set ssh config for cephuser
$ cat <<EOF > /home/cephuser/.ssh/config
Host admin
        Hostname admin
        User cephuser
 
Host mon1
        Hostname mon1
        User cephuser
 
Host osd1
        Hostname osd1
        User cephuser
 
Host osd2
        Hostname osd2
        User cephuser
 
Host osd3
        Hostname osd3
        User cephuser
 
Host client
        Hostname client
        User cephuser
EOF
$ chmod 644 ~/.ssh/config

On the admin node:

root@admin# systemctl start firewalld
root@admin# systemctl enable firewalld
root@admin# firewall-cmd --zone=public --add-port=80/tcp --permanent
root@admin# firewall-cmd --zone=public --add-port=2003/tcp --permanent
root@admin# firewall-cmd --zone=public --add-port=4505-4506/tcp --permanent
root@admin# firewall-cmd --reload

On mon1:

root@mon1# systemctl start firewalld
root@mon1# systemctl enable firewalld
root@mon1# firewall-cmd --zone=public --add-port=6789/tcp --permanent
root@mon1# firewall-cmd --reload

On the OSD nodes:

root@osdN# systemctl start firewalld
root@osdN# systemctl enable firewalld
root@osdN# firewall-cmd --zone=public --add-port=6800-7300/tcp --permanent
root@osdN# firewall-cmd --reload


On the admin node, install ceph-deploy. Get the latest package from https://docs.ceph.com/docs/master/install/get-packages/.

# rpm -Uhv http://download.ceph.com/rpm-jewel/el7/noarch/ceph-release-1-1.el7.noarch.rpm
# yum install -y ceph-deploy

## Make a new cluster directory
cephuser@admin:~# mkdir cluster
cephuser@admin:~/cluster# cd cluster

Run the cecph-deploy command. This will create a ceph cluster configuration file at ceph.conf in the cluster directory.

cephuser@admin:~/cluster# ceph-deploy new mon1
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephuser/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.39): /bin/ceph-deploy new mon1
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  func                          : <function new at 0x7f08afaea668>
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f08af2685f0>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  ssh_copykey                   : True
[ceph_deploy.cli][INFO  ]  mon                           : ['mon1']
[ceph_deploy.cli][INFO  ]  public_network                : None
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  cluster_network               : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  fsid                          : None
[ceph_deploy.new][DEBUG ] Creating new cluster named ceph
[ceph_deploy.new][INFO  ] making sure passwordless SSH succeeds
[mon1][DEBUG ] connected to host: admin
[mon1][INFO  ] Running command: ssh -CT -o BatchMode=yes mon1
[mon1][DEBUG ] connection detected need for sudo
[mon1][DEBUG ] connected to host: mon1
[mon1][DEBUG ] detect platform information from remote host
[mon1][DEBUG ] detect machine type
[mon1][DEBUG ] find the location of an executable
[mon1][INFO  ] Running command: sudo /usr/sbin/ip link show
[mon1][INFO  ] Running command: sudo /usr/sbin/ip addr show
[mon1][DEBUG ] IP addresses found: [u'192.168.1.196']
[ceph_deploy.new][DEBUG ] Resolving host mon1
[ceph_deploy.new][DEBUG ] Monitor mon1 at 192.168.1.196
[ceph_deploy.new][DEBUG ] Monitor initial members are ['mon1']
[ceph_deploy.new][DEBUG ] Monitor addrs are ['192.168.1.196']
[ceph_deploy.new][DEBUG ] Creating a random mon key...
[ceph_deploy.new][DEBUG ] Writing monitor keyring to ceph.mon.keyring...
[ceph_deploy.new][DEBUG ] Writing initial config to ceph.conf...

Change the [global] block and define the network the cluster is running in.

# Your network address
public network = 192.168.1.1/24
osd pool default size = 2

Install Ceph:

$ ceph-deploy install admin mon1 osd1 osd2 osd3

Install ceph-mon on mon1 and gather keys. If you don't do this, you will get an error like [ceph_deploy][ERROR ] RuntimeError: bootstrap-osd keyring not found; run 'gatherkeys'.

$ ceph-deploy mon create-initial
$ ceph-deploy gatherkeys mon1

Find all disks on the OSD nodes:

$ ceph-deploy disk list osd1 osd2 osd3
...
[osd3][INFO  ] Running command: sudo /usr/sbin/ceph-disk list
[osd3][DEBUG ] /dev/dm-0 other, xfs, mounted on /
[osd3][DEBUG ] /dev/dm-1 swap, swap
[osd3][DEBUG ] /dev/sda :
[osd3][DEBUG ]  /dev/sda2 other, LVM2_member
[osd3][DEBUG ]  /dev/sda1 other, xfs, mounted on /boot
[osd3][DEBUG ] /dev/sdb other, unknown
[osd3][DEBUG ] /dev/sr0 other, iso9660

After determining all the disks that are available on each node, clear and prepare them for use.

## Zap disks to clear any partitions
$ ceph-deploy disk zap osd1:/dev/sdb osd2:/dev/sdb osd3:/dev/sdb

## Prepare the disks for OSD use
$ ceph-deploy osd prepare osd1:/dev/sdb osd2:/dev/sdb osd3:/dev/sdb
...
[osd3][INFO  ] checking OSD status...
[osd3][DEBUG ] find the location of an executable
[osd3][INFO  ] Running command: sudo /bin/ceph --cluster=ceph osd stat --format=json
[ceph_deploy.osd][DEBUG ] Host osd3 is now ready for osd use.

## If all nodes are ready, activate them.
$ ceph-deploy osd activate osd1:/dev/sdb1 osd2:/dev/sdb1 osd3:/dev/sdb1

At this point, all OSD nodes should have a ceph journal and data partition created. Verify this with:

$ ceph-deploy disk list osd1 osd2 osd3
...
[osd3][INFO  ] Running command: sudo /usr/sbin/ceph-disk list
[osd3][DEBUG ] /dev/dm-0 other, xfs, mounted on /
[osd3][DEBUG ] /dev/dm-1 swap, swap
[osd3][DEBUG ] /dev/sda :
[osd3][DEBUG ]  /dev/sda2 other, LVM2_member
[osd3][DEBUG ]  /dev/sda1 other, xfs, mounted on /boot
[osd3][DEBUG ] /dev/sdb :
[osd3][DEBUG ]  /dev/sdb2 ceph journal, for /dev/sdb1
[osd3][DEBUG ]  /dev/sdb1 ceph data, active, cluster ceph, osd.2, journal /dev/sdb2
[osd3][DEBUG ] /dev/sr0 other, iso9660

Deploy the management keys across all nodes:

$ ceph-deploy admin admin mon1 osd1 osd2 osd3


On the monitor mon1 node:

[cephuser@mon1 ~]$ sudo chmod 644 /etc/ceph/ceph.client.admin.keyring
[cephuser@mon1 ~]$ ceph health
HEALTH_OK

[cephuser@mon1 ~]$ ceph -s
    cluster 2b45a14a-4ac9-4fac-ae88-9b117d0147dd
     health HEALTH_OK
     monmap e1: 1 mons at {mon1=192.168.1.196:6789/0}
            election epoch 3, quorum 0 mon1
     osdmap e15: 3 osds: 3 up, 3 in
            flags sortbitwise,require_jewel_osds
      pgmap v28: 64 pgs, 1 pools, 0 bytes data, 0 objects
            322 MB used, 61084 MB / 61406 MB avail
                  64 active+clean


Ceph Block Volume

Create a new RBD (RADOS Block Device) on the cluster.

On the admin machine:

[root@admin ~]# rbd create storage --size 1024
[root@admin ~]# rbd ls
storage
[root@admin ~]# rbd --image storage info
rbd image 'storage':
        size 1024 MB in 256 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.10226b8b4567
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
        flags:

Retrieve the client admin key from /etc/ceph/ceph.client.admin.keyring on the mon1 node.

On a client:

## Install the rdb utils
# yum -y install ceph-common

## Load the RBD kernel module. supported since 2.6.37
# modprobe rdb

## Copy over the /etc/ceph/ceph.conf and /etc/ceph/ceph.client.admin.keyring over from mon1

Then map the volume.

# rbd map storage
rbd: sysfs write failed
RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable storage object-map fast-diff".
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (6) No such device or address

# dmesg -T | tail
[Tue Sep 10 15:41:05 2019] libceph: mon1 192.168.1.196:6789 session established
[Tue Sep 10 15:41:05 2019] libceph: client4134 fsid 2b45a14a-4ac9-4fac-ae88-9b117d0147dd
[Tue Sep 10 15:41:05 2019] rbd: image storage: image uses unsupported features: 0x18

# rbd feature disable storage object-map fast-diff
/dev/rbd0

# fdisk -l /dev/rbd0
Disk /dev/rbd0: 1 GiB, 1073741824 bytes, 2097152 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 65536 bytes / 65536 bytes

From there, you can use /dev/rbd0 as any other block device.

See Also