ungleich-staticcms/content/u/blog/2022-08-27-migrating-ceph-n.../contents.lr

title: [WIP] Migrating Ceph Nautilus into Kubernetes + Rook
---
pub_date: 2022-08-27
---
author: ungleich storage team
---
twitter_handle: ungleich
---
_hidden: no
---
_discoverable: yes
---
abstract:
How we move our Ceph clusters into kubernetes
---
body:

## Introduction

At ungleich we are running multiple Ceph clusters. Some of them are
running Ceph Nautilus (14.x) based on
[Devuan](https://www.devuan.org/). Our newer Ceph Pacific (16.x)
clusters are running based on [Rook](https://rook.io/) on
[Kubernetes](https://kubernetes.io/) on top of
[Alpine Linux](https://alpinelinux.org/).

In this blog article we will describe how to migrate
Ceph/Native/Devuan to Ceph/k8s+rook/Alpine Linux.

## Work in Progress [WIP]

This blog article is work in progress. The migration planning has
started, however the migration has not been finished yet. This article
will feature the different paths we take for the migration.

## The Plan

To continue operating the cluster during the migration, the following
steps are planned:

* Setup a k8s cluster that can potentially communicate with the
  existing ceph cluster
* Using the [disaster
  recovery](https://rook.io/docs/rook/v1.9/Troubleshooting/disaster-recovery/)
  guidelines from rook to modify the rook configuration to use the
  previous fsid.
* Spin up ceph monitors and ceph managers in rook
* Retire existing monitors
* Shutdown a ceph OSD node, remove it's OS disk, boot it with Alpine
  Linux
* Join the node into the k8s cluster
* Have rook pickup the existing disks and start the osds
* Repeat if successful
* Migrate to ceph pacific

### Original cluster

The target ceph cluster we want to migrate lives in the 2a0a:e5c0::/64
network. Ceph is using:

```
public network  = 2a0a:e5c0:0:0::/64
cluster network = 2a0a:e5c0:0:0::/64
```

### Kubernetes cluster networking inside the ceph network

To be able to communicate with the existing OSDs, we will be using
sub networks of 2a0a:e5c0::/64 for kubernetes. As these networks
are part of the link assigned network 2a0a:e5c0::/64, we will use BGP
routing on the existing ceph nodes to create more specific routes into
the kubernetes cluster.

As we plan to use either [cilium](https://cilium.io/) or
[calico](https://www.tigera.io/project-calico/) as the CNI, we can
configure kubernetes to directly BGP peer with the existing Ceph
nodes.

## The setup

### Kubernetes Bootstrap

As usual we bootstrap 3 control plane nodes using kubeadm. The proxy
for the API resides in a different kuberentes cluster.

We run

```
kubeadm init --config kubeadm.yaml
```

on the first node and join the other two control plane nodes. As
usual, joining the workers last.

### k8s Networking / CNI

For this setup we are using calico as described in the
[ungleich kubernetes
manual](https://redmine.ungleich.ch/projects/open-infrastructure/wiki/The_ungleich_kubernetes_infrastructure#section-23).

```
VERSION=v3.23.3
helm repo add projectcalico https://docs.projectcalico.org/charts
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace
```

### BGP Networking on the old nodes

To be able to import the BGP routes from Kubernetes, all old / native
hosts will run bird. The installation and configuration is as follows:

```
apt-get update
apt-get install -y bird2

router_id=$(hostname | sed 's/server//')

cat > /etc/bird/bird.conf <<EOF

router id $router_id;

log syslog all;
protocol device {
}
 # We are only interested in IPv6, skip another section for IPv4
protocol kernel {
        ipv6 { export all; };
}
protocol bgp k8s {
        local     as 65530;
        neighbor range 2a0a:e5c0::/64 as 65533;
        dynamic name "k8s_"; direct;

        ipv6 {
            import filter { if net.len > 64 then accept; else reject; };
            export none;
        };
}
EOF
/etc/init.d/bird restart

```

The router id must be adjusted for every host. As all hosts have a
unique number, we use that one as the router id.
The bird configuration allows to use dynamic peers so that any k8s
node in the network can peer with the old servers.

We also use a filter to avoid receiving /64 routes, as they are
overlapping with the on link route.

### BGP networking in Kubernetes

Calico supports BGP peering and we use a rather standard calico
configuration:

```
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
  name: default
spec:
  logSeverityScreen: Info
  nodeToNodeMeshEnabled: true
  asNumber: 65533
  serviceClusterIPs:
  - cidr: 2a0a:e5c0:0:aaaa::/108
  serviceExternalIPs:
  - cidr: 2a0a:e5c0:0:aaaa::/108
```

Plus for each server and router we create a BGPPeer:

```
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
  name: serverXX
spec:
  peerIP: 2a0a:e5c0::XX
  asNumber: 65530
  keepOriginalNextHop: true
```

We apply the whole configuration using calicoctl:

```
./calicoctl create -f - < ~/vcs/k8s-config/bootstrap/p5-cow/calico-bgp.yaml
```

And a few seconds later we can observer the routes on the old / native
hosts:

```
bird> show protocols
Name       Proto      Table      State  Since         Info
device1    Device     ---        up     23:09:01.393
kernel1    Kernel     master6    up     23:09:01.393
k8s        BGP        ---        start  23:09:01.393  Passive
k8s_1      BGP        ---        up     23:33:01.215  Established
k8s_2      BGP        ---        up     23:33:01.215  Established
k8s_3      BGP        ---        up     23:33:01.420  Established
k8s_4      BGP        ---        up     23:33:01.215  Established
k8s_5      BGP        ---        up     23:33:01.215  Established

```

### Testing networking

To verify that the new cluster is working properly, we can deploy a
tiny test deployment and see if it is globally reachable:

```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.20.0-alpine
        ports:
        - containerPort: 80
```

And the corresponding service:

```
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  selector:
    app: nginx
  ports:
    - protocol: TCP
      port: 80
```

Using curl to access a sample service from the outside shows that
networking is working:

```
% curl -v http://[2a0a:e5c0:0:aaaa::e3c9]
*   Trying 2a0a:e5c0:0:aaaa::e3c9:80...
* Connected to 2a0a:e5c0:0:aaaa::e3c9 (2a0a:e5c0:0:aaaa::e3c9) port 80 (#0)
> GET / HTTP/1.1
> Host: [2a0a:e5c0:0:aaaa::e3c9]
> User-Agent: curl/7.84.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.20.0
< Date: Sat, 27 Aug 2022 22:35:49 GMT
< Content-Type: text/html
< Content-Length: 612
< Last-Modified: Tue, 20 Apr 2021 16:11:05 GMT
< Connection: keep-alive
< ETag: "607efd19-264"
< Accept-Ranges: bytes
<
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>
* Connection #0 to host 2a0a:e5c0:0:aaaa::e3c9 left intact
```

So far we have found 1 issue:

* Sometimes the old/native servers can reach the service, sometimes
they get a timeout

In old calico notes on github it is referenced that overlapping
Pod/CIDR networks might be a problem. Additionally we cannot use
kubeadm to initialise the podsubnet to be a proper subnet of the node
subnet:

```
[00:15] server57.place5:~# kubeadm init --service-cidr 2a0a:e5c0:0:cccc::/108 --pod-network-cidr 2a0a:e5c0::/100
I0829 00:16:38.659341   19400 version.go:255] remote version is much newer: v1.25.0; falling back to: stable-1.24
podSubnet: Invalid value: "2a0a:e5c0::/100": the size of pod subnet with mask 100 is smaller than the size of node subnet with mask 64
To see the stack trace of this error execute with --v=5 or higher
[00:16] server57.place5:~#
```

### Networking 2022-09-03

* Instead of trying to merge the cluster networks, we will use
  separate ranges
* According to the [ceph users mailing list
  discussion](https://www.spinics.net/lists/ceph-users/msg73421.html)
  it is actually not necessary for mons/osds to be in the same
  network. In fact, we might be able to remove these settings
  completely.

So today we start with

* podSubnet: 2a0a:e5c0:0:14::/64
* serviceSubnet: 2a0a:e5c0:0:15::/108

Using BGP and calico, the kubernetes cluster is setup "as usual" (for
ungleich terms).

### Ceph.conf change

Originally our ceph.conf contained:

```
public network  = 2a0a:e5c0:0:0::/64
cluster network = 2a0a:e5c0:0:0::/64
```

As of today they are removed and all daemons are restarted, allowing
the native cluster to speak with the kubernetes cluster.

### Setting up rook

Usually we deploy rook via argocd. However as we want to be easily
able to do manual intervention, we will first bootstrap rook via helm
directly and turn off various services

```
helm repo add rook https://charts.rook.io/release
helm repo update
```

We will use rook 1.8, as it is the last version to support Ceph
nautilus, which is our current ceph version. The latest 1.8 version is
1.8.10 at the moment.

```
helm upgrade --install --namespace rook-ceph --create-namespace --version v1.8.10 rook-ceph rook/rook-ceph
```

### Joining the 2 clusters, step 1: monitors and managers

In the first step we want to add rook based monitors and managers
and replace the native ones. For rook to be able to talk to our
existing cluster, it needs to know

* the current monitors/managers ("the monmap")
* the right keys to talk to the existing cluster
* the fsid

As we are using v1.8, we will follow
[the guidelines for disaster recover of rook
1.8](https://www.rook.io/docs/rook/v1.8/ceph-disaster-recovery.html).

Later we will need to create all the configurations so that rook knows
about the different pools.

### Rook: CephCluster

Rook has a configuration of type `CephCluster` that typically looks
something like this:

```
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    # see the "Cluster Settings" section below for more details on which image of ceph to run
    image: quay.io/ceph/ceph:{{ .Chart.AppVersion }}
  dataDirHostPath: /var/lib/rook
  mon:
    count: 5
    allowMultiplePerNode: false
  storage:
    useAllNodes: true
    useAllDevices: true
    onlyApplyOSDPlacement: false
  mgr:
    count: 1
    modules:
      - name: pg_autoscaler
        enabled: true
  network:
    ipFamily: "IPv6"
    dualStack: false
  crashCollector:
    disable: false
    # Uncomment daysToRetain to prune ceph crash entries older than the
    # specified number of days.
    daysToRetain: 30
```

For migrating, we don't want rook in the first stage to create any
OSDs. So we will replace `useAllNodes: true` with `useAllNodes: false`
and `useAllDevices: true` also with `useAllDevices: false`.


### Extracting a monmap

To get access to the existing monmap, we can export it from the native
cluster using `ceph-mon -i {mon-id} --extract-monmap {map-path}`.
More details can be found on the [documentation for adding and
removing ceph
monitors](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/).


### Rook and Ceph pools

Rook uses `CephBlockPool` to describe ceph pools as follows:

```
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: hdd
  namespace: rook-ceph
spec:
  failureDomain: host
  replicated:
    size: 3
  deviceClass: hdd
```

In this particular cluster we have 2 pools:

- one (ssd based, device class = ssd)
- hdd (hdd based, device class = hdd-big)

The device class "hdd-big" is specific to this cluster as it used to
contain 2.5" and 3.5" HDDs in different pools.

### [old] Analysing the ceph cluster configuration

Taking the view from the old cluster, the following items are
important for adding new services/nodes:

* We have a specific fsid that needs to be known
 * The expectation would be to find that fsid in a configmap/secret in rook
* We have a list of running monitors
 * This is part of the monmap and ceph.conf
 * ceph.conf is used for finding the initial contact point
 * Afterwards the information is provided by the monitors
 * For rook it would be expected to have a configmap/secret listing
   the current monitors
* The native clusters have a "ceph.client.admin.keyring" deployed which
allows adding and removing resources.
 * Rook probably has a secret for keyrings
 * Maybe multiple depending on how services are organised

### Analysing the rook configurations

Taking the opposite view, we can also checkout a running rook cluster

### Configuring ceph after the operator deployment

As soon as the operator and the crds have been deployed, we deploy the
following
[CephCluster](https://rook.io/docs/rook/v1.8/ceph-cluster-crd.html)
configuration:

```
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v14.2.21
  dataDirHostPath: /var/lib/rook
  mon:
    count: 5
    allowMultiplePerNode: false
  storage:
    useAllNodes: false
    useAllDevices: false
    onlyApplyOSDPlacement: false
  mgr:
    count: 1
    modules:
      - name: pg_autoscaler
        enabled: true
  network:
    ipFamily: "IPv6"
    dualStack: false
  crashCollector:
    disable: false
    # Uncomment daysToRetain to prune ceph crash entries older than the
    # specified number of days.
    daysToRetain: 30
```

We wait for the cluster to initialise and stabilise before applying
changes. Important to note is that we use the ceph image version
v14.2.21, which is the same version as the native cluster.


## Changelog

### 2022-09-03

* Next try starting for migration
* Looking deeper into configurations

### 2022-08-29

* Added kubernetes/kubeadm bootstrap issue

### 2022-08-27

* The initial release of this blog article
* Added k8s bootstrapping guide

## Follow up or questions

You can join the discussion in the matrix room `#kubernetes:ungleich.ch`
about this migration. If don't have a matrix
account you can join using our chat on https://chat.with.ungleich.ch.