1211 lines
35 KiB
Markdown
1211 lines
35 KiB
Markdown
title: [WIP] Migrating Ceph Nautilus into Kubernetes + Rook
|
|
---
|
|
pub_date: 2022-08-27
|
|
---
|
|
author: ungleich storage team
|
|
---
|
|
twitter_handle: ungleich
|
|
---
|
|
_hidden: no
|
|
---
|
|
_discoverable: yes
|
|
---
|
|
abstract:
|
|
How we move our Ceph clusters into kubernetes
|
|
---
|
|
body:
|
|
|
|
## Introduction
|
|
|
|
At ungleich we are running multiple Ceph clusters. Some of them are
|
|
running Ceph Nautilus (14.x) based on
|
|
[Devuan](https://www.devuan.org/). Our newer Ceph Pacific (16.x)
|
|
clusters are running based on [Rook](https://rook.io/) on
|
|
[Kubernetes](https://kubernetes.io/) on top of
|
|
[Alpine Linux](https://alpinelinux.org/).
|
|
|
|
In this blog article we will describe how to migrate
|
|
Ceph/Native/Devuan to Ceph/k8s+rook/Alpine Linux.
|
|
|
|
## Work in Progress [WIP]
|
|
|
|
This blog article is work in progress. The migration planning has
|
|
started, however the migration has not been finished yet. This article
|
|
will feature the different paths we take for the migration.
|
|
|
|
## The Plan
|
|
|
|
To continue operating the cluster during the migration, the following
|
|
steps are planned:
|
|
|
|
* Setup a k8s cluster that can potentially communicate with the
|
|
existing ceph cluster
|
|
* Using the [disaster
|
|
recovery](https://rook.io/docs/rook/v1.9/Troubleshooting/disaster-recovery/)
|
|
guidelines from rook to modify the rook configuration to use the
|
|
previous fsid.
|
|
* Spin up ceph monitors and ceph managers in rook
|
|
* Retire existing monitors
|
|
* Shutdown a ceph OSD node, remove it's OS disk, boot it with Alpine
|
|
Linux
|
|
* Join the node into the k8s cluster
|
|
* Have rook pickup the existing disks and start the osds
|
|
* Repeat if successful
|
|
* Migrate to ceph pacific
|
|
|
|
### Original cluster
|
|
|
|
The target ceph cluster we want to migrate lives in the 2a0a:e5c0::/64
|
|
network. Ceph is using:
|
|
|
|
```
|
|
public network = 2a0a:e5c0:0:0::/64
|
|
cluster network = 2a0a:e5c0:0:0::/64
|
|
```
|
|
|
|
### Kubernetes cluster networking inside the ceph network
|
|
|
|
To be able to communicate with the existing OSDs, we will be using
|
|
sub networks of 2a0a:e5c0::/64 for kubernetes. As these networks
|
|
are part of the link assigned network 2a0a:e5c0::/64, we will use BGP
|
|
routing on the existing ceph nodes to create more specific routes into
|
|
the kubernetes cluster.
|
|
|
|
As we plan to use either [cilium](https://cilium.io/) or
|
|
[calico](https://www.tigera.io/project-calico/) as the CNI, we can
|
|
configure kubernetes to directly BGP peer with the existing Ceph
|
|
nodes.
|
|
|
|
## The setup
|
|
|
|
### Kubernetes Bootstrap
|
|
|
|
As usual we bootstrap 3 control plane nodes using kubeadm. The proxy
|
|
for the API resides in a different kuberentes cluster.
|
|
|
|
We run
|
|
|
|
```
|
|
kubeadm init --config kubeadm.yaml
|
|
```
|
|
|
|
on the first node and join the other two control plane nodes. As
|
|
usual, joining the workers last.
|
|
|
|
### k8s Networking / CNI
|
|
|
|
For this setup we are using calico as described in the
|
|
[ungleich kubernetes
|
|
manual](https://redmine.ungleich.ch/projects/open-infrastructure/wiki/The_ungleich_kubernetes_infrastructure#section-23).
|
|
|
|
```
|
|
VERSION=v3.23.3
|
|
helm repo add projectcalico https://docs.projectcalico.org/charts
|
|
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace
|
|
```
|
|
|
|
### BGP Networking on the old nodes
|
|
|
|
To be able to import the BGP routes from Kubernetes, all old / native
|
|
hosts will run bird. The installation and configuration is as follows:
|
|
|
|
```
|
|
apt-get update
|
|
apt-get install -y bird2
|
|
|
|
router_id=$(hostname | sed 's/server//')
|
|
|
|
cat > /etc/bird/bird.conf <<EOF
|
|
|
|
router id $router_id;
|
|
|
|
log syslog all;
|
|
protocol device {
|
|
}
|
|
# We are only interested in IPv6, skip another section for IPv4
|
|
protocol kernel {
|
|
ipv6 { export all; };
|
|
}
|
|
protocol bgp k8s {
|
|
local as 65530;
|
|
neighbor range 2a0a:e5c0::/64 as 65533;
|
|
dynamic name "k8s_"; direct;
|
|
|
|
ipv6 {
|
|
import filter { if net.len > 64 then accept; else reject; };
|
|
export none;
|
|
};
|
|
}
|
|
EOF
|
|
/etc/init.d/bird restart
|
|
|
|
```
|
|
|
|
The router id must be adjusted for every host. As all hosts have a
|
|
unique number, we use that one as the router id.
|
|
The bird configuration allows to use dynamic peers so that any k8s
|
|
node in the network can peer with the old servers.
|
|
|
|
We also use a filter to avoid receiving /64 routes, as they are
|
|
overlapping with the on link route.
|
|
|
|
### BGP networking in Kubernetes
|
|
|
|
Calico supports BGP peering and we use a rather standard calico
|
|
configuration:
|
|
|
|
```
|
|
apiVersion: projectcalico.org/v3
|
|
kind: BGPConfiguration
|
|
metadata:
|
|
name: default
|
|
spec:
|
|
logSeverityScreen: Info
|
|
nodeToNodeMeshEnabled: true
|
|
asNumber: 65533
|
|
serviceClusterIPs:
|
|
- cidr: 2a0a:e5c0:0:aaaa::/108
|
|
serviceExternalIPs:
|
|
- cidr: 2a0a:e5c0:0:aaaa::/108
|
|
```
|
|
|
|
Plus for each server and router we create a BGPPeer:
|
|
|
|
```
|
|
apiVersion: projectcalico.org/v3
|
|
kind: BGPPeer
|
|
metadata:
|
|
name: serverXX
|
|
spec:
|
|
peerIP: 2a0a:e5c0::XX
|
|
asNumber: 65530
|
|
keepOriginalNextHop: true
|
|
```
|
|
|
|
We apply the whole configuration using calicoctl:
|
|
|
|
```
|
|
./calicoctl create -f - < ~/vcs/k8s-config/bootstrap/p5-cow/calico-bgp.yaml
|
|
```
|
|
|
|
And a few seconds later we can observer the routes on the old / native
|
|
hosts:
|
|
|
|
```
|
|
bird> show protocols
|
|
Name Proto Table State Since Info
|
|
device1 Device --- up 23:09:01.393
|
|
kernel1 Kernel master6 up 23:09:01.393
|
|
k8s BGP --- start 23:09:01.393 Passive
|
|
k8s_1 BGP --- up 23:33:01.215 Established
|
|
k8s_2 BGP --- up 23:33:01.215 Established
|
|
k8s_3 BGP --- up 23:33:01.420 Established
|
|
k8s_4 BGP --- up 23:33:01.215 Established
|
|
k8s_5 BGP --- up 23:33:01.215 Established
|
|
|
|
```
|
|
|
|
### Testing networking
|
|
|
|
To verify that the new cluster is working properly, we can deploy a
|
|
tiny test deployment and see if it is globally reachable:
|
|
|
|
```
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: nginx-deployment
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: nginx
|
|
replicas: 2
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: nginx
|
|
spec:
|
|
containers:
|
|
- name: nginx
|
|
image: nginx:1.20.0-alpine
|
|
ports:
|
|
- containerPort: 80
|
|
```
|
|
|
|
And the corresponding service:
|
|
|
|
```
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: nginx-service
|
|
spec:
|
|
selector:
|
|
app: nginx
|
|
ports:
|
|
- protocol: TCP
|
|
port: 80
|
|
```
|
|
|
|
Using curl to access a sample service from the outside shows that
|
|
networking is working:
|
|
|
|
```
|
|
% curl -v http://[2a0a:e5c0:0:aaaa::e3c9]
|
|
* Trying 2a0a:e5c0:0:aaaa::e3c9:80...
|
|
* Connected to 2a0a:e5c0:0:aaaa::e3c9 (2a0a:e5c0:0:aaaa::e3c9) port 80 (#0)
|
|
> GET / HTTP/1.1
|
|
> Host: [2a0a:e5c0:0:aaaa::e3c9]
|
|
> User-Agent: curl/7.84.0
|
|
> Accept: */*
|
|
>
|
|
* Mark bundle as not supporting multiuse
|
|
< HTTP/1.1 200 OK
|
|
< Server: nginx/1.20.0
|
|
< Date: Sat, 27 Aug 2022 22:35:49 GMT
|
|
< Content-Type: text/html
|
|
< Content-Length: 612
|
|
< Last-Modified: Tue, 20 Apr 2021 16:11:05 GMT
|
|
< Connection: keep-alive
|
|
< ETag: "607efd19-264"
|
|
< Accept-Ranges: bytes
|
|
<
|
|
<!DOCTYPE html>
|
|
<html>
|
|
<head>
|
|
<title>Welcome to nginx!</title>
|
|
<style>
|
|
body {
|
|
width: 35em;
|
|
margin: 0 auto;
|
|
font-family: Tahoma, Verdana, Arial, sans-serif;
|
|
}
|
|
</style>
|
|
</head>
|
|
<body>
|
|
<h1>Welcome to nginx!</h1>
|
|
<p>If you see this page, the nginx web server is successfully installed and
|
|
working. Further configuration is required.</p>
|
|
|
|
<p>For online documentation and support please refer to
|
|
<a href="http://nginx.org/">nginx.org</a>.<br/>
|
|
Commercial support is available at
|
|
<a href="http://nginx.com/">nginx.com</a>.</p>
|
|
|
|
<p><em>Thank you for using nginx.</em></p>
|
|
</body>
|
|
</html>
|
|
* Connection #0 to host 2a0a:e5c0:0:aaaa::e3c9 left intact
|
|
```
|
|
|
|
So far we have found 1 issue:
|
|
|
|
* Sometimes the old/native servers can reach the service, sometimes
|
|
they get a timeout
|
|
|
|
In old calico notes on github it is referenced that overlapping
|
|
Pod/CIDR networks might be a problem. Additionally we cannot use
|
|
kubeadm to initialise the podsubnet to be a proper subnet of the node
|
|
subnet:
|
|
|
|
```
|
|
[00:15] server57.place5:~# kubeadm init --service-cidr 2a0a:e5c0:0:cccc::/108 --pod-network-cidr 2a0a:e5c0::/100
|
|
I0829 00:16:38.659341 19400 version.go:255] remote version is much newer: v1.25.0; falling back to: stable-1.24
|
|
podSubnet: Invalid value: "2a0a:e5c0::/100": the size of pod subnet with mask 100 is smaller than the size of node subnet with mask 64
|
|
To see the stack trace of this error execute with --v=5 or higher
|
|
[00:16] server57.place5:~#
|
|
```
|
|
|
|
### Networking 2022-09-03
|
|
|
|
* Instead of trying to merge the cluster networks, we will use
|
|
separate ranges
|
|
* According to the [ceph users mailing list
|
|
discussion](https://www.spinics.net/lists/ceph-users/msg73421.html)
|
|
it is actually not necessary for mons/osds to be in the same
|
|
network. In fact, we might be able to remove these settings
|
|
completely.
|
|
|
|
So today we start with
|
|
|
|
* podSubnet: 2a0a:e5c0:0:14::/64
|
|
* serviceSubnet: 2a0a:e5c0:0:15::/108
|
|
|
|
Using BGP and calico, the kubernetes cluster is setup "as usual" (for
|
|
ungleich terms).
|
|
|
|
### Ceph.conf change
|
|
|
|
Originally our ceph.conf contained:
|
|
|
|
```
|
|
public network = 2a0a:e5c0:0:0::/64
|
|
cluster network = 2a0a:e5c0:0:0::/64
|
|
```
|
|
|
|
As of today they are removed and all daemons are restarted, allowing
|
|
the native cluster to speak with the kubernetes cluster.
|
|
|
|
### Setting up rook
|
|
|
|
Usually we deploy rook via argocd. However as we want to be easily
|
|
able to do manual intervention, we will first bootstrap rook via helm
|
|
directly and turn off various services
|
|
|
|
```
|
|
helm repo add rook https://charts.rook.io/release
|
|
helm repo update
|
|
```
|
|
|
|
We will use rook 1.8, as it is the last version to support Ceph
|
|
nautilus, which is our current ceph version. The latest 1.8 version is
|
|
1.8.10 at the moment.
|
|
|
|
```
|
|
helm upgrade --install --namespace rook-ceph --create-namespace --version v1.8.10 rook-ceph rook/rook-ceph
|
|
```
|
|
|
|
### Joining the 2 clusters, step 1: monitors and managers
|
|
|
|
In the first step we want to add rook based monitors and managers
|
|
and replace the native ones. For rook to be able to talk to our
|
|
existing cluster, it needs to know
|
|
|
|
* the current monitors/managers ("the monmap")
|
|
* the right keys to talk to the existing cluster
|
|
* the fsid
|
|
|
|
As we are using v1.8, we will follow
|
|
[the guidelines for disaster recover of rook
|
|
1.8](https://www.rook.io/docs/rook/v1.8/ceph-disaster-recovery.html).
|
|
|
|
Later we will need to create all the configurations so that rook knows
|
|
about the different pools.
|
|
|
|
### Rook: CephCluster
|
|
|
|
Rook has a configuration of type `CephCluster` that typically looks
|
|
something like this:
|
|
|
|
```
|
|
apiVersion: ceph.rook.io/v1
|
|
kind: CephCluster
|
|
metadata:
|
|
name: rook-ceph
|
|
namespace: rook-ceph
|
|
spec:
|
|
cephVersion:
|
|
# see the "Cluster Settings" section below for more details on which image of ceph to run
|
|
image: quay.io/ceph/ceph:{{ .Chart.AppVersion }}
|
|
dataDirHostPath: /var/lib/rook
|
|
mon:
|
|
count: 5
|
|
allowMultiplePerNode: false
|
|
storage:
|
|
useAllNodes: true
|
|
useAllDevices: true
|
|
onlyApplyOSDPlacement: false
|
|
mgr:
|
|
count: 1
|
|
modules:
|
|
- name: pg_autoscaler
|
|
enabled: true
|
|
network:
|
|
ipFamily: "IPv6"
|
|
dualStack: false
|
|
crashCollector:
|
|
disable: false
|
|
# Uncomment daysToRetain to prune ceph crash entries older than the
|
|
# specified number of days.
|
|
daysToRetain: 30
|
|
```
|
|
|
|
For migrating, we don't want rook in the first stage to create any
|
|
OSDs. So we will replace `useAllNodes: true` with `useAllNodes: false`
|
|
and `useAllDevices: true` also with `useAllDevices: false`.
|
|
|
|
|
|
### Extracting a monmap
|
|
|
|
To get access to the existing monmap, we can export it from the native
|
|
cluster using `ceph-mon -i {mon-id} --extract-monmap {map-path}`.
|
|
More details can be found on the [documentation for adding and
|
|
removing ceph
|
|
monitors](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/).
|
|
|
|
|
|
|
|
### Rook and Ceph pools
|
|
|
|
Rook uses `CephBlockPool` to describe ceph pools as follows:
|
|
|
|
```
|
|
apiVersion: ceph.rook.io/v1
|
|
kind: CephBlockPool
|
|
metadata:
|
|
name: hdd
|
|
namespace: rook-ceph
|
|
spec:
|
|
failureDomain: host
|
|
replicated:
|
|
size: 3
|
|
deviceClass: hdd
|
|
```
|
|
|
|
In this particular cluster we have 2 pools:
|
|
|
|
- one (ssd based, device class = ssd)
|
|
- hdd (hdd based, device class = hdd-big)
|
|
|
|
The device class "hdd-big" is specific to this cluster as it used to
|
|
contain 2.5" and 3.5" HDDs in different pools.
|
|
|
|
|
|
### [old] Analysing the ceph cluster configuration
|
|
|
|
Taking the view from the old cluster, the following items are
|
|
important for adding new services/nodes:
|
|
|
|
* We have a specific fsid that needs to be known
|
|
* The expectation would be to find that fsid in a configmap/secret in rook
|
|
* We have a list of running monitors
|
|
* This is part of the monmap and ceph.conf
|
|
* ceph.conf is used for finding the initial contact point
|
|
* Afterwards the information is provided by the monitors
|
|
* For rook it would be expected to have a configmap/secret listing
|
|
the current monitors
|
|
* The native clusters have a "ceph.client.admin.keyring" deployed which
|
|
allows adding and removing resources.
|
|
* Rook probably has a secret for keyrings
|
|
* Maybe multiple depending on how services are organised
|
|
|
|
### Analysing the rook configurations
|
|
|
|
Taking the opposite view, we can also checkout a running rook cluster
|
|
and the rook disaster recovery documentation to identify what to
|
|
modify.
|
|
|
|
Let's have a look at the secrets first:
|
|
|
|
```
|
|
cluster-peer-token-rook-ceph kubernetes.io/rook 2 320d
|
|
default-token-xm9xs kubernetes.io/service-account-token 3 320d
|
|
rook-ceph-admin-keyring kubernetes.io/rook 1 320d
|
|
rook-ceph-admission-controller kubernetes.io/tls 3 29d
|
|
rook-ceph-cmd-reporter-token-5mh88 kubernetes.io/service-account-token 3 320d
|
|
rook-ceph-config kubernetes.io/rook 2 320d
|
|
rook-ceph-crash-collector-keyring kubernetes.io/rook 1 320d
|
|
rook-ceph-mgr-a-keyring kubernetes.io/rook 1 320d
|
|
rook-ceph-mgr-b-keyring kubernetes.io/rook 1 320d
|
|
rook-ceph-mgr-token-ktt2m kubernetes.io/service-account-token 3 320d
|
|
rook-ceph-mon kubernetes.io/rook 4 320d
|
|
rook-ceph-mons-keyring kubernetes.io/rook 1 320d
|
|
rook-ceph-osd-token-8m6lb kubernetes.io/service-account-token 3 320d
|
|
rook-ceph-purge-osd-token-hznnk kubernetes.io/service-account-token 3 320d
|
|
rook-ceph-rgw-token-wlzbc kubernetes.io/service-account-token 3 134d
|
|
rook-ceph-system-token-lxclf kubernetes.io/service-account-token 3 320d
|
|
rook-csi-cephfs-node kubernetes.io/rook 2 320d
|
|
rook-csi-cephfs-plugin-sa-token-hkq2g kubernetes.io/service-account-token 3 320d
|
|
rook-csi-cephfs-provisioner kubernetes.io/rook 2 320d
|
|
rook-csi-cephfs-provisioner-sa-token-tb78d kubernetes.io/service-account-token 3 320d
|
|
rook-csi-rbd-node kubernetes.io/rook 2 320d
|
|
rook-csi-rbd-plugin-sa-token-dhhq6 kubernetes.io/service-account-token 3 320d
|
|
rook-csi-rbd-provisioner kubernetes.io/rook 2 320d
|
|
rook-csi-rbd-provisioner-sa-token-lhr4l kubernetes.io/service-account-token 3 320d
|
|
```
|
|
|
|
TBC
|
|
|
|
### Creating additional resources after the cluster is bootstrapped
|
|
|
|
To let rook know what should be there, we already create the two
|
|
`CephBlockPool` instances that match the existing pools:
|
|
|
|
```apiVersion: ceph.rook.io/v1
|
|
kind: CephBlockPool
|
|
metadata:
|
|
name: one
|
|
namespace: rook-ceph
|
|
spec:
|
|
failureDomain: host
|
|
replicated:
|
|
size: 3
|
|
deviceClass: ssd
|
|
```
|
|
|
|
And for the hdd based pool:
|
|
|
|
```
|
|
apiVersion: ceph.rook.io/v1
|
|
kind: CephBlockPool
|
|
metadata:
|
|
name: hdd
|
|
namespace: rook-ceph
|
|
spec:
|
|
failureDomain: host
|
|
replicated:
|
|
size: 3
|
|
deviceClass: hdd-big
|
|
```
|
|
|
|
Saving both of these in ceph-blockpools.yaml and applying it:
|
|
|
|
```
|
|
kubectl -n rook-ceph apply -f ceph-blockpools.yaml
|
|
```
|
|
|
|
### Configuring ceph after the operator deployment
|
|
|
|
As soon as the operator and the crds have been deployed, we deploy the
|
|
following
|
|
[CephCluster](https://rook.io/docs/rook/v1.8/ceph-cluster-crd.html)
|
|
configuration:
|
|
|
|
```
|
|
apiVersion: ceph.rook.io/v1
|
|
kind: CephCluster
|
|
metadata:
|
|
name: rook-ceph
|
|
namespace: rook-ceph
|
|
spec:
|
|
cephVersion:
|
|
image: quay.io/ceph/ceph:v14.2.21
|
|
dataDirHostPath: /var/lib/rook
|
|
mon:
|
|
count: 5
|
|
allowMultiplePerNode: false
|
|
storage:
|
|
useAllNodes: false
|
|
useAllDevices: false
|
|
onlyApplyOSDPlacement: false
|
|
mgr:
|
|
count: 1
|
|
modules:
|
|
- name: pg_autoscaler
|
|
enabled: true
|
|
network:
|
|
ipFamily: "IPv6"
|
|
dualStack: false
|
|
crashCollector:
|
|
disable: false
|
|
# Uncomment daysToRetain to prune ceph crash entries older than the
|
|
# specified number of days.
|
|
daysToRetain: 30
|
|
```
|
|
|
|
We wait for the cluster to initialise and stabilise before applying
|
|
changes. Important to note is that we use the ceph image version
|
|
v14.2.21, which is the same version as the native cluster.
|
|
|
|
|
|
|
|
### rook v1.8 is incompatible with ceph nautilus
|
|
|
|
After deploying the rook operator, the following error message is
|
|
printed in its logs:
|
|
|
|
```
|
|
2022-09-03 15:14:03.543925 E | ceph-cluster-controller: failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed the ceph version check: the version does not meet the minimum version "15.2.0-0 octopus"
|
|
```
|
|
|
|
So we need to downgrade to rook v1.7. Using `helm search repo
|
|
rook/rook-ceph --versions` we identify the latest usable version
|
|
should be `v1.7.11`.
|
|
|
|
We start the downgrade process using
|
|
|
|
```
|
|
helm upgrade --install --namespace rook-ceph --create-namespace --version v1.7.11 rook-ceph rook/rook-ceph
|
|
```
|
|
|
|
After downgrading the operator is starting the canary monitors and
|
|
continues to bootstrap the cluster.
|
|
|
|
### The ceph-toolbox
|
|
|
|
To be able to view the current cluster status, we also deploy the
|
|
ceph-toolbox for interacting with rook:
|
|
|
|
```
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: rook-ceph-tools
|
|
namespace: rook-ceph # namespace:cluster
|
|
labels:
|
|
app: rook-ceph-tools
|
|
spec:
|
|
replicas: 1
|
|
selector:
|
|
matchLabels:
|
|
app: rook-ceph-tools
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: rook-ceph-tools
|
|
spec:
|
|
dnsPolicy: ClusterFirstWithHostNet
|
|
containers:
|
|
- name: rook-ceph-tools
|
|
image: rook/ceph:v1.7.11
|
|
command: ["/bin/bash"]
|
|
args: ["-m", "-c", "/usr/local/bin/toolbox.sh"]
|
|
imagePullPolicy: IfNotPresent
|
|
tty: true
|
|
securityContext:
|
|
runAsNonRoot: true
|
|
runAsUser: 2016
|
|
runAsGroup: 2016
|
|
env:
|
|
- name: ROOK_CEPH_USERNAME
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: rook-ceph-mon
|
|
key: ceph-username
|
|
- name: ROOK_CEPH_SECRET
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: rook-ceph-mon
|
|
key: ceph-secret
|
|
volumeMounts:
|
|
- mountPath: /etc/ceph
|
|
name: ceph-config
|
|
- name: mon-endpoint-volume
|
|
mountPath: /etc/rook
|
|
volumes:
|
|
- name: mon-endpoint-volume
|
|
configMap:
|
|
name: rook-ceph-mon-endpoints
|
|
items:
|
|
- key: data
|
|
path: mon-endpoints
|
|
- name: ceph-config
|
|
emptyDir: {}
|
|
tolerations:
|
|
- key: "node.kubernetes.io/unreachable"
|
|
operator: "Exists"
|
|
effect: "NoExecute"
|
|
tolerationSeconds: 5
|
|
```
|
|
|
|
### Checking the deployments
|
|
|
|
After the rook-operator finished deploying, the following deployments
|
|
are visible in kubernetes:
|
|
|
|
```
|
|
[17:25] blind:~% kubectl -n rook-ceph get deployment
|
|
NAME READY UP-TO-DATE AVAILABLE AGE
|
|
csi-cephfsplugin-provisioner 2/2 2 2 21m
|
|
csi-rbdplugin-provisioner 2/2 2 2 21m
|
|
rook-ceph-crashcollector-server48 1/1 1 1 2m3s
|
|
rook-ceph-crashcollector-server52 1/1 1 1 2m24s
|
|
rook-ceph-crashcollector-server53 1/1 1 1 2m2s
|
|
rook-ceph-crashcollector-server56 1/1 1 1 2m17s
|
|
rook-ceph-crashcollector-server57 1/1 1 1 2m1s
|
|
rook-ceph-mgr-a 1/1 1 1 2m3s
|
|
rook-ceph-mon-a 1/1 1 1 10m
|
|
rook-ceph-mon-b 1/1 1 1 8m3s
|
|
rook-ceph-mon-c 1/1 1 1 5m55s
|
|
rook-ceph-mon-d 1/1 1 1 5m33s
|
|
rook-ceph-mon-e 1/1 1 1 4m32s
|
|
rook-ceph-operator 1/1 1 1 102m
|
|
rook-ceph-tools 1/1 1 1 17m
|
|
```
|
|
|
|
Relevant for us are the mgr, mon and operator. To stop the cluster, we
|
|
will shutdown the deployments in the following order:
|
|
|
|
* rook-ceph-operator: to prevent deployments to recover
|
|
|
|
### Data / configuration comparison
|
|
|
|
Logging into a host that is running mon-a, we find the following data
|
|
in it:
|
|
|
|
```
|
|
[17:36] server56.place5:/var/lib/rook# find
|
|
.
|
|
./mon-a
|
|
./mon-a/data
|
|
./mon-a/data/keyring
|
|
./mon-a/data/min_mon_release
|
|
./mon-a/data/store.db
|
|
./mon-a/data/store.db/LOCK
|
|
./mon-a/data/store.db/000006.log
|
|
./mon-a/data/store.db/000004.sst
|
|
./mon-a/data/store.db/CURRENT
|
|
./mon-a/data/store.db/MANIFEST-000005
|
|
./mon-a/data/store.db/OPTIONS-000008
|
|
./mon-a/data/store.db/OPTIONS-000005
|
|
./mon-a/data/store.db/IDENTITY
|
|
./mon-a/data/kv_backend
|
|
./rook-ceph
|
|
./rook-ceph/crash
|
|
./rook-ceph/crash/posted
|
|
./rook-ceph/log
|
|
```
|
|
|
|
Which is pretty similar to the native nodes:
|
|
|
|
```
|
|
[17:37:50] red3.place5:/var/lib/ceph/mon/ceph-red3# find
|
|
.
|
|
./sysvinit
|
|
./keyring
|
|
./min_mon_release
|
|
./kv_backend
|
|
./store.db
|
|
./store.db/1959645.sst
|
|
./store.db/1959800.sst
|
|
./store.db/OPTIONS-3617174
|
|
./store.db/2056973.sst
|
|
./store.db/3617348.sst
|
|
./store.db/OPTIONS-3599785
|
|
./store.db/MANIFEST-3617171
|
|
./store.db/1959695.sst
|
|
./store.db/CURRENT
|
|
./store.db/LOCK
|
|
./store.db/2524598.sst
|
|
./store.db/IDENTITY
|
|
./store.db/1959580.sst
|
|
./store.db/2514570.sst
|
|
./store.db/1959831.sst
|
|
./store.db/3617346.log
|
|
./store.db/2511347.sst
|
|
```
|
|
|
|
### Checking how monitors are created on native ceph
|
|
|
|
To prepare for the migration we take 1 step back and verify how
|
|
monitors are created in the native cluster. The script used for
|
|
monitoring creation can be found on
|
|
[code.ungleich.ch](https://code.ungleich.ch/ungleich-public/ungleich-tools/src/branch/master/ceph/ceph-mon-create-start)
|
|
and contains the following logic:
|
|
|
|
* get "mon." key
|
|
* get the monmap
|
|
* Run ceph-mon --mkfs using the monmap and keyring
|
|
* Start it
|
|
|
|
In theory we could re-use these steps on a rook deployed monitor to
|
|
join our existing cluster.
|
|
|
|
### Checking the toolbox and monitor pods for migration
|
|
|
|
When the ceph-toolbox is deployed, we get a ceph.conf and a keyring in
|
|
/ect/ceph. The keyring is actually the admin keyring and allows us to
|
|
make modifications to the ceph cluster. The ceph.conf points to the
|
|
monitors and does not contain an fsid.
|
|
|
|
The ceph-toolbox gets this informatoin via 1 configmap
|
|
("rook-ceph-mon-endpoints") and a secret ("rook-ceph-mon").
|
|
|
|
The monitor pods on the other hand have an empty ceph.conf and no
|
|
admin keyring deployed.
|
|
|
|
### Try 1: recreating a monitor inside the existing cluster
|
|
|
|
Let's try to reuse an existing monitor and join it into the existing
|
|
cluster. For this we will first shut down the rook-operator, to
|
|
prevent it to intefere with our migration. Then
|
|
modify the relevant configmaps and secrets and import the settings
|
|
from the native cluster.
|
|
|
|
Lastly we will patch one of the monitor pods, inject the monmap from
|
|
the native cluster and then restart it.
|
|
|
|
Let's give it a try. First we shutdown the rook-ceph-operator:
|
|
|
|
```
|
|
% kubectl -n rook-ceph scale --replicas=0 deploy/rook-ceph-operator
|
|
deployment.apps/rook-ceph-operator scaled
|
|
```
|
|
|
|
Then we patch the mon deployments to not run a monitor, but only
|
|
sleep:
|
|
|
|
```
|
|
for mon in a b c d e; do
|
|
kubectl -n rook-ceph patch deployment rook-ceph-mon-${mon} -p \
|
|
'{"spec": {"template": {"spec": {"containers": [{"name": "mon", "command": ["sleep", "infinity"], "args": []}]}}}}';
|
|
|
|
kubectl -n rook-ceph patch deployment rook-ceph-mon-$mon --type='json' -p '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]'
|
|
done
|
|
```
|
|
|
|
No the pod is restarted and when we execute into it, we will see that
|
|
no monitor is running in it:
|
|
|
|
```
|
|
% kubectl -n rook-ceph exec -ti rook-ceph-mon-a-c9f8f554b-2fkhm -- sh
|
|
Defaulted container "mon" out of: mon, chown-container-data-dir (init), init-mon-fs (init)
|
|
sh-4.2# ps aux
|
|
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
|
|
root 1 0.0 0.0 4384 664 ? Ss 19:44 0:00 sleep infinity
|
|
root 7 0.0 0.0 11844 2844 pts/0 Ss 19:44 0:00 sh
|
|
root 13 0.0 0.0 51752 3384 pts/0 R+ 19:44 0:00 ps aux
|
|
sh-4.2#
|
|
```
|
|
|
|
Now for this pod to work with our existing cluster, we want to import
|
|
the monmap and join the monitor to the native cluster. As with any
|
|
mon, the data is stored below `/var/lib/ceph/mon/ceph-a/`.
|
|
|
|
Before importing the monmap, let's have a look at the different rook
|
|
configurations that influence the ceph components
|
|
|
|
### Looking at the ConfigMap in detail: rook-ceph-mon-endpoints
|
|
|
|
As the name says, it contains the list of monitor endpoints:
|
|
|
|
```
|
|
kubectl -n rook-ceph edit configmap rook-ceph-mon-endpoints
|
|
...
|
|
|
|
csi-cluster-config-json: '[{"clusterID":"rook-ceph","monitors":["[2a0a:e5c0:0:15::fc2]:6789"...
|
|
data: b=[2a0a:e5c0:0:15::9cd9]:6789,....
|
|
mapping: '{"node":{"a":{"Name":"server56","Hostname":"server56","Address":"2a0a:e5c0::...
|
|
```
|
|
|
|
As eventually we want the cluster and csi to use the in-cluster
|
|
monitors, we don't need to modify it right away.
|
|
|
|
### Looking at Secrets in detail: rook-ceph-admin-keyring
|
|
|
|
The first interesting secret is **rook-ceph-admin-keyring**, which
|
|
contains the admin keyring. The old one of course, so we can edit this
|
|
secret and replace it with the client.admin secret from our native
|
|
cluster.
|
|
|
|
We encode the original admin keyring using:
|
|
|
|
```
|
|
cat ceph.client.admin.keyring | base64 -w 0; echo ""
|
|
```
|
|
|
|
And then we update the secret it:
|
|
|
|
```
|
|
kubectl -n rook-ceph edit secret rook-ceph-admin-keyring
|
|
```
|
|
|
|
[done]
|
|
|
|
### Looking at Secrets in detail: rook-ceph-config
|
|
|
|
This secret contains two keys, **mon_host** and
|
|
**mon_initial_members**. The **mon_host** is a list of monitor
|
|
addresses. The **mon_host** only contains the monitor names, a, b, c, d and e.
|
|
|
|
The environment variable **ROOK_CEPH_MON_HOST** in the monitor
|
|
deployment is set to to **mon_host** key of that secret, so monitors
|
|
will read from it.
|
|
|
|
|
|
|
|
### Looking at Secrets in detail: rook-ceph-mon
|
|
|
|
This secret contains the following interesting keys:
|
|
|
|
* ceph-secret: the admin key (just the base64 key no section around
|
|
it) [done]
|
|
* ceph-username: "client.admin"
|
|
* fsid: the ceph cluster fsid
|
|
* mon-secret: The key of the [mon.] section
|
|
|
|
It's important to mention to use `echo -n` when inserting
|
|
the keys or fsids.
|
|
|
|
[done]
|
|
|
|
### Looking at Secrets in detail: rook-ceph-mons-keyring
|
|
|
|
Contains the key "keyring" containing the [mon.] and [client.admin]
|
|
sections:
|
|
|
|
|
|
```
|
|
[mon.]
|
|
key = ...
|
|
|
|
[client.admin]
|
|
key = ...
|
|
caps mds = "allow"
|
|
caps mgr = "allow *"
|
|
caps mon = "allow *"
|
|
caps osd = "allow *"
|
|
```
|
|
|
|
Using `base64 -w0 < ~/mon-and-client`.
|
|
|
|
[done]
|
|
|
|
### Importing the monmap
|
|
|
|
Getting the current monmap from the native cluster:
|
|
|
|
```
|
|
ceph mon getmap -o monmap-20220903
|
|
|
|
scp root@old-monitor:monmap-20220903
|
|
```
|
|
|
|
Adding it into the mon pod:
|
|
|
|
```
|
|
kubectl cp monmap-20220903 rook-ceph/rook-ceph-mon-a-6c46d4694-kxm5h:/tmp
|
|
```
|
|
|
|
Moving the old mon db away:
|
|
|
|
```
|
|
cd /var/lib/ceph/mon/ceph-a
|
|
mkdir _old
|
|
mv [a-z]* _old/
|
|
```
|
|
|
|
Recreating the mon fails, as the volume is mounted directly onto it:
|
|
|
|
```
|
|
% ceph-mon -i a --mkfs --monmap /tmp/monmap-20220903 --keyring /tmp/mon-key
|
|
2022-09-03 21:44:48.268 7f1a738f51c0 -1 '/var/lib/ceph/mon/ceph-a' already exists and is not empty: monitor may already exist
|
|
|
|
% mount | grep ceph-a
|
|
/dev/sda1 on /var/lib/ceph/mon/ceph-a type ext4 (rw,relatime)
|
|
|
|
```
|
|
|
|
We can workaround this by creating all monitors on pods with other
|
|
names. So we can create mon b to e on the mon-a pod and mon-a on any
|
|
other pod.
|
|
|
|
On rook-ceph-mon-a:
|
|
|
|
```
|
|
for mon in b c d e;
|
|
do ceph-mon -i $mon --mkfs --monmap /tmp/monmap-20220903 --keyring /tmp/mon-key;
|
|
done
|
|
```
|
|
|
|
On rook-ceph-mon-b:
|
|
|
|
```
|
|
mon=a
|
|
ceph-mon -i $mon --mkfs --monmap /tmp/monmap-20220903 --keyring /tmp/mon-key
|
|
```
|
|
|
|
Then we export the newly created mon dbs:
|
|
|
|
```
|
|
for mon in b c d e;
|
|
do kubectl cp rook-ceph/rook-ceph-mon-a-6c46d4694-kxm5h:/var/lib/ceph/mon/ceph-$mon ceph-$mon;
|
|
done
|
|
```
|
|
|
|
```
|
|
for mon in a;
|
|
do kubectl cp rook-ceph/rook-ceph-mon-b-57d888dd9f-w8jkh:/var/lib/ceph/mon/ceph-$mon ceph-$mon;
|
|
done
|
|
```
|
|
|
|
And finally we test it by importing the mondb to mon-a:
|
|
|
|
```
|
|
kubectl cp ceph-a
|
|
rook-ceph/rook-ceph-mon-a-6c46d4694-kxm5h:/var/lib/ceph/mon/
|
|
```
|
|
|
|
And the other mons:
|
|
|
|
```
|
|
kubectl cp ceph-b rook-ceph/rook-ceph-mon-b-57d888dd9f-w8jkh:/var/lib/ceph/mon/
|
|
|
|
```
|
|
|
|
### Re-enabling the rook-operator
|
|
|
|
As the deployment
|
|
|
|
```
|
|
kubectl -n rook-ceph scale --replicas=1 deploy/rook-ceph-operator
|
|
```
|
|
|
|
Operator sees them running (with a shell)
|
|
|
|
```
|
|
2022-09-03 22:29:26.725915 I | op-mon: mons running: [d e a b c]
|
|
|
|
```
|
|
|
|
Triggering recreation:
|
|
|
|
```
|
|
% kubectl -n rook-ceph delete deployment rook-ceph-mon-a
|
|
deployment.apps "rook-ceph-mon-a" deleted
|
|
```
|
|
|
|
Connected successfully to the cluster:
|
|
|
|
```
|
|
|
|
services:
|
|
mon: 6 daemons, quorum red1,red2,red3,server4,server3,a (age 8s)
|
|
mgr: red3(active, since 8h), standbys: red2, red1, server4
|
|
osd: 46 osds: 46 up, 46 in
|
|
|
|
```
|
|
|
|
A bit later:
|
|
|
|
```
|
|
mon: 8 daemons, quorum (age 2w), out of quorum: red1, red2, red3, server4, server3, a, c,
|
|
d
|
|
mgr: red3(active, since 8h), standbys: red2, red1, server4
|
|
osd: 46 osds: 46 up, 46 in
|
|
|
|
```
|
|
|
|
And a little bit later also the mgr joined the cluster:
|
|
|
|
```
|
|
services:
|
|
mon: 8 daemons, quorum red2,red3,server4,server3,a,c,d,e (age 46s)
|
|
mgr: red3(active, since 9h), standbys: red1, server4, a, red2
|
|
osd: 46 osds: 46 up, 46 in
|
|
|
|
```
|
|
|
|
And a few minutes later all mons joined successfully:
|
|
|
|
```
|
|
mon: 8 daemons, quorum red3,server4,server3,a,c,d,e,b (age 31s)
|
|
mgr: red3(active, since 105s), standbys: red1, server4, a, red2
|
|
osd: 46 osds: 46 up, 46 in
|
|
|
|
```
|
|
|
|
We also need to ensure the toolbox is being updated/recreated:
|
|
|
|
```
|
|
kubectl -n rook-ceph delete pods rook-ceph-tools-5cf88dd58f-fwwlc
|
|
```
|
|
|
|
|
|
### Original monitors vanish
|
|
|
|
Did not add bgp peering.
|
|
Cannot reach ceph through the routers.
|
|
|
|
Seems like rook did remove them.
|
|
|
|
Updating the ceph.conf for the native nodes:
|
|
|
|
```
|
|
mon host = rook-ceph-mon-a.rook-ceph.svc..,
|
|
```
|
|
|
|
### Post monitor migration issue 1: OSDs start crashing
|
|
|
|
A day after the monitor migration some OSDs start to crash. Checking
|
|
out the debug log we found the following error:
|
|
|
|
```
|
|
2022-09-05 10:24:02.881 7fe005ce7700 -1 Processor -- bind unable to bind to v2:[2a0a:e5c0::225:b3ff:fe20:3554]:7300/3712937 on any port in range 6800-7300: (99) Cannot assign requested address
|
|
2022-09-05 10:24:02.881 7fe005ce7700 -1 Processor -- bind was unable to bind. Trying again in 5 seconds
|
|
2022-09-05 10:24:07.897 7fe005ce7700 -1 Processor -- bind unable to bind to v2:[2a0a:e5c0::225:b3ff:fe20:3554]:7300/3712937 on any port in range 6800-7300: (99) Cannot assign requested address
|
|
2022-09-05 10:24:07.897 7fe005ce7700 -1 Processor -- bind was unable to bind after 3 attempts: (99) Cannot assign requested address
|
|
2022-09-05 10:24:07.897 7fe0127b1700 -1 received signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
|
|
2022-09-05 10:24:07.897 7fe0127b1700 -1 osd.49 100709 *** Got signal Interrupt ***
|
|
2022-09-05 10:24:07.897 7fe0127b1700 -1 osd.49 100709 *** Immediate shutdown (osd_fast_shutdown=true) ***
|
|
```
|
|
|
|
Trying to bind to an IPv6 address that is **not** on the system.
|
|
|
|
https://tracker.ceph.com/issues/24602
|
|
|
|
Calico/CNI does IP rewriting and thus tells the OSD the wrong IPv6
|
|
address.
|
|
|
|
Adding
|
|
|
|
```
|
|
public_addr = 2a0a:e5c0::92e2:baff:fe26:642c
|
|
```
|
|
|
|
to the node. Verifying the binding after restarting the crashing OSD:
|
|
|
|
```
|
|
[10:35:06] server4.place5:/var/log/ceph# netstat -lnpW | grep 3717792
|
|
tcp6 0 0 2a0a:e5c0::92e2:baff:fe26:642c:6821 :::* LISTEN 3717792/ceph-osd
|
|
tcp6 0 0 :::6822 :::* LISTEN 3717792/ceph-osd
|
|
tcp6 0 0 :::6823 :::* LISTEN 3717792/ceph-osd
|
|
tcp6 0 0 2a0a:e5c0::92e2:baff:fe26:642c:6816 :::* LISTEN 3717792/ceph-osd
|
|
tcp6 0 0 2a0a:e5c0::92e2:baff:fe26:642c:6817 :::* LISTEN 3717792/ceph-osd
|
|
tcp6 0 0 :::6818 :::* LISTEN 3717792/ceph-osd
|
|
tcp6 0 0 :::6819 :::* LISTEN 3717792/ceph-osd
|
|
tcp6 0 0 2a0a:e5c0::92e2:baff:fe26:642c:6820 :::* LISTEN 3717792/ceph-osd
|
|
unix 2 [ ACC ] STREAM LISTENING 16880318 3717792/ceph-osd /var/run/ceph/ceph-osd.49.asok
|
|
```
|
|
|
|
### Post monitor migration issue 1: OSDs start crashing
|
|
|
|
After roughly a week an OSD on the native cluster started to fail on
|
|
restart with the following error:
|
|
|
|
```
|
|
unable to parse addrs
|
|
in 'rook-ceph-mon-a.rook-ceph.svc.p5-cow.k8s.ooo,
|
|
rook-ceph-mon-b.rook-ceph.svc.p5-cow.k8s.ooo,
|
|
rook-ceph-mon-c.rook-ceph.svc.p5-cow.k8s.ooo,
|
|
rook-ceph-mon-d.rook-ceph.svc.p5-cow.k8s.ooo,
|
|
rook-ceph-mon-e.rook-ceph.svc.p5-cow.k8s.ooo'
|
|
```
|
|
|
|
Checking the cluster, it seems rook has replaced mon-a with mon-f:
|
|
|
|
```
|
|
[22:38] blind:~% kubectl -n rook-ceph get svc
|
|
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
|
csi-cephfsplugin-metrics ClusterIP 2a0a:e5c0:0:15::f2ac <none> 8080/TCP,8081/TCP 7d5h
|
|
csi-rbdplugin-metrics ClusterIP 2a0a:e5c0:0:15::5fc2 <none> 8080/TCP,8081/TCP 7d5h
|
|
rook-ceph-mgr ClusterIP 2a0a:e5c0:0:15::c31c <none> 9283/TCP 7d5h
|
|
rook-ceph-mon-b ClusterIP 2a0a:e5c0:0:15::9cd9 <none> 6789/TCP,3300/TCP 7d5h
|
|
rook-ceph-mon-c ClusterIP 2a0a:e5c0:0:15::fc2 <none> 6789/TCP,3300/TCP 7d5h
|
|
rook-ceph-mon-d ClusterIP 2a0a:e5c0:0:15::b029 <none> 6789/TCP,3300/TCP 7d5h
|
|
rook-ceph-mon-e ClusterIP 2a0a:e5c0:0:15::8c86 <none> 6789/TCP,3300/TCP 7d5h
|
|
rook-ceph-mon-f ClusterIP 2a0a:e5c0:0:15::2833 <none> 6789/TCP,3300/TCP 3d13h
|
|
```
|
|
|
|
At this moment it is unclear why ceph does it, but if the native hosts
|
|
had already been migrated, this would probably not have caused an
|
|
issue. However as long as ceph.conf files are deployed with static
|
|
references to the monitors, this problem might repeat.
|
|
|
|
|
|
## Changelog
|
|
|
|
### 2022-09-10
|
|
|
|
* Added missing monitor description
|
|
|
|
|
|
### 2022-09-03
|
|
|
|
* Next try starting for migration
|
|
* Looking deeper into configurations
|
|
|
|
### 2022-08-29
|
|
|
|
* Added kubernetes/kubeadm bootstrap issue
|
|
|
|
### 2022-08-27
|
|
|
|
* The initial release of this blog article
|
|
* Added k8s bootstrapping guide
|
|
|
|
## Follow up or questions
|
|
|
|
You can join the discussion in the matrix room `#kubernetes:ungleich.ch`
|
|
about this migration. If don't have a matrix
|
|
account you can join using our chat on https://chat.with.ungleich.ch.
|