483 lines
12 KiB
Markdown
483 lines
12 KiB
Markdown
title: [WIP] Migrating Ceph Nautilus into Kubernetes + Rook
|
|
---
|
|
pub_date: 2022-08-27
|
|
---
|
|
author: ungleich storage team
|
|
---
|
|
twitter_handle: ungleich
|
|
---
|
|
_hidden: no
|
|
---
|
|
_discoverable: yes
|
|
---
|
|
abstract:
|
|
How we move our Ceph clusters into kubernetes
|
|
---
|
|
body:
|
|
|
|
## Introduction
|
|
|
|
At ungleich we are running multiple Ceph clusters. Some of them are
|
|
running Ceph Nautilus (14.x) based on
|
|
[Devuan](https://www.devuan.org/). Our newer Ceph Pacific (16.x)
|
|
clusters are running based on [Rook](https://rook.io/) on
|
|
[Kubernetes](https://kubernetes.io/) on top of
|
|
[Alpine Linux](https://alpinelinux.org/).
|
|
|
|
In this blog article we will describe how to migrate
|
|
Ceph/Native/Devuan to Ceph/k8s+rook/Alpine Linux.
|
|
|
|
## Work in Progress [WIP]
|
|
|
|
This blog article is work in progress. The migration planning has
|
|
started, however the migration has not been finished yet. This article
|
|
will feature the different paths we take for the migration.
|
|
|
|
## The Plan
|
|
|
|
To continue operating the cluster during the migration, the following
|
|
steps are planned:
|
|
|
|
* Setup a k8s cluster that can potentially communicate with the
|
|
existing ceph cluster
|
|
* Using the [disaster
|
|
recovery](https://rook.io/docs/rook/v1.9/Troubleshooting/disaster-recovery/)
|
|
guidelines from rook to modify the rook configuration to use the
|
|
previous fsid.
|
|
* Spin up ceph monitors and ceph managers in rook
|
|
* Retire existing monitors
|
|
* Shutdown a ceph OSD node, remove it's OS disk, boot it with Alpine
|
|
Linux
|
|
* Join the node into the k8s cluster
|
|
* Have rook pickup the existing disks and start the osds
|
|
* Repeat if successful
|
|
* Migrate to ceph pacific
|
|
|
|
### Original cluster
|
|
|
|
The target ceph cluster we want to migrate lives in the 2a0a:e5c0::/64
|
|
network. Ceph is using:
|
|
|
|
```
|
|
public network = 2a0a:e5c0:0:0::/64
|
|
cluster network = 2a0a:e5c0:0:0::/64
|
|
```
|
|
|
|
### Kubernetes cluster networking inside the ceph network
|
|
|
|
To be able to communicate with the existing OSDs, we will be using
|
|
sub networks of 2a0a:e5c0::/64 for kubernetes. As these networks
|
|
are part of the link assigned network 2a0a:e5c0::/64, we will use BGP
|
|
routing on the existing ceph nodes to create more specific routes into
|
|
the kubernetes cluster.
|
|
|
|
As we plan to use either [cilium](https://cilium.io/) or
|
|
[calico](https://www.tigera.io/project-calico/) as the CNI, we can
|
|
configure kubernetes to directly BGP peer with the existing Ceph
|
|
nodes.
|
|
|
|
## The setup
|
|
|
|
### Kubernetes Bootstrap
|
|
|
|
As usual we bootstrap 3 control plane nodes using kubeadm. The proxy
|
|
for the API resides in a different kuberentes cluster.
|
|
|
|
We run
|
|
|
|
```
|
|
kubeadm init --config kubeadm.yaml
|
|
```
|
|
|
|
on the first node and join the other two control plane nodes. As
|
|
usual, joining the workers last.
|
|
|
|
### k8s Networking / CNI
|
|
|
|
For this setup we are using calico as described in the
|
|
[ungleich kubernetes
|
|
manual](https://redmine.ungleich.ch/projects/open-infrastructure/wiki/The_ungleich_kubernetes_infrastructure#section-23).
|
|
|
|
```
|
|
VERSION=v3.23.3
|
|
helm repo add projectcalico https://docs.projectcalico.org/charts
|
|
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace
|
|
```
|
|
|
|
### BGP Networking on the old nodes
|
|
|
|
To be able to import the BGP routes from Kubernetes, all old / native
|
|
hosts will run bird. The installation and configuration is as follows:
|
|
|
|
```
|
|
apt-get update
|
|
apt-get install -y bird2
|
|
|
|
router_id=$(hostname | sed 's/server//')
|
|
|
|
cat > /etc/bird/bird.conf <<EOF
|
|
|
|
router id $router_id;
|
|
|
|
log syslog all;
|
|
protocol device {
|
|
}
|
|
# We are only interested in IPv6, skip another section for IPv4
|
|
protocol kernel {
|
|
ipv6 { export all; };
|
|
}
|
|
protocol bgp k8s {
|
|
local as 65530;
|
|
neighbor range 2a0a:e5c0::/64 as 65533;
|
|
dynamic name "k8s_"; direct;
|
|
|
|
ipv6 {
|
|
import filter { if net.len > 64 then accept; else reject; };
|
|
export none;
|
|
};
|
|
}
|
|
EOF
|
|
/etc/init.d/bird restart
|
|
|
|
```
|
|
|
|
The router id must be adjusted for every host. As all hosts have a
|
|
unique number, we use that one as the router id.
|
|
The bird configuration allows to use dynamic peers so that any k8s
|
|
node in the network can peer with the old servers.
|
|
|
|
We also use a filter to avoid receiving /64 routes, as they are
|
|
overlapping with the on link route.
|
|
|
|
### BGP networking in Kubernetes
|
|
|
|
Calico supports BGP peering and we use a rather standard calico
|
|
configuration:
|
|
|
|
```
|
|
apiVersion: projectcalico.org/v3
|
|
kind: BGPConfiguration
|
|
metadata:
|
|
name: default
|
|
spec:
|
|
logSeverityScreen: Info
|
|
nodeToNodeMeshEnabled: true
|
|
asNumber: 65533
|
|
serviceClusterIPs:
|
|
- cidr: 2a0a:e5c0:0:aaaa::/108
|
|
serviceExternalIPs:
|
|
- cidr: 2a0a:e5c0:0:aaaa::/108
|
|
```
|
|
|
|
Plus for each server and router we create a BGPPeer:
|
|
|
|
```
|
|
apiVersion: projectcalico.org/v3
|
|
kind: BGPPeer
|
|
metadata:
|
|
name: serverXX
|
|
spec:
|
|
peerIP: 2a0a:e5c0::XX
|
|
asNumber: 65530
|
|
keepOriginalNextHop: true
|
|
```
|
|
|
|
We apply the whole configuration using calicoctl:
|
|
|
|
```
|
|
./calicoctl create -f - < ~/vcs/k8s-config/bootstrap/p5-cow/calico-bgp.yaml
|
|
```
|
|
|
|
And a few seconds later we can observer the routes on the old / native
|
|
hosts:
|
|
|
|
```
|
|
bird> show protocols
|
|
Name Proto Table State Since Info
|
|
device1 Device --- up 23:09:01.393
|
|
kernel1 Kernel master6 up 23:09:01.393
|
|
k8s BGP --- start 23:09:01.393 Passive
|
|
k8s_1 BGP --- up 23:33:01.215 Established
|
|
k8s_2 BGP --- up 23:33:01.215 Established
|
|
k8s_3 BGP --- up 23:33:01.420 Established
|
|
k8s_4 BGP --- up 23:33:01.215 Established
|
|
k8s_5 BGP --- up 23:33:01.215 Established
|
|
|
|
```
|
|
|
|
### Testing networking
|
|
|
|
To verify that the new cluster is working properly, we can deploy a
|
|
tiny test deployment and see if it is globally reachable:
|
|
|
|
```
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: nginx-deployment
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: nginx
|
|
replicas: 2
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: nginx
|
|
spec:
|
|
containers:
|
|
- name: nginx
|
|
image: nginx:1.20.0-alpine
|
|
ports:
|
|
- containerPort: 80
|
|
```
|
|
|
|
And the corresponding service:
|
|
|
|
```
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: nginx-service
|
|
spec:
|
|
selector:
|
|
app: nginx
|
|
ports:
|
|
- protocol: TCP
|
|
port: 80
|
|
```
|
|
|
|
Using curl to access a sample service from the outside shows that
|
|
networking is working:
|
|
|
|
```
|
|
% curl -v http://[2a0a:e5c0:0:aaaa::e3c9]
|
|
* Trying 2a0a:e5c0:0:aaaa::e3c9:80...
|
|
* Connected to 2a0a:e5c0:0:aaaa::e3c9 (2a0a:e5c0:0:aaaa::e3c9) port 80 (#0)
|
|
> GET / HTTP/1.1
|
|
> Host: [2a0a:e5c0:0:aaaa::e3c9]
|
|
> User-Agent: curl/7.84.0
|
|
> Accept: */*
|
|
>
|
|
* Mark bundle as not supporting multiuse
|
|
< HTTP/1.1 200 OK
|
|
< Server: nginx/1.20.0
|
|
< Date: Sat, 27 Aug 2022 22:35:49 GMT
|
|
< Content-Type: text/html
|
|
< Content-Length: 612
|
|
< Last-Modified: Tue, 20 Apr 2021 16:11:05 GMT
|
|
< Connection: keep-alive
|
|
< ETag: "607efd19-264"
|
|
< Accept-Ranges: bytes
|
|
<
|
|
<!DOCTYPE html>
|
|
<html>
|
|
<head>
|
|
<title>Welcome to nginx!</title>
|
|
<style>
|
|
body {
|
|
width: 35em;
|
|
margin: 0 auto;
|
|
font-family: Tahoma, Verdana, Arial, sans-serif;
|
|
}
|
|
</style>
|
|
</head>
|
|
<body>
|
|
<h1>Welcome to nginx!</h1>
|
|
<p>If you see this page, the nginx web server is successfully installed and
|
|
working. Further configuration is required.</p>
|
|
|
|
<p>For online documentation and support please refer to
|
|
<a href="http://nginx.org/">nginx.org</a>.<br/>
|
|
Commercial support is available at
|
|
<a href="http://nginx.com/">nginx.com</a>.</p>
|
|
|
|
<p><em>Thank you for using nginx.</em></p>
|
|
</body>
|
|
</html>
|
|
* Connection #0 to host 2a0a:e5c0:0:aaaa::e3c9 left intact
|
|
```
|
|
|
|
So far we have found 1 issue:
|
|
|
|
* Sometimes the old/native servers can reach the service, sometimes
|
|
they get a timeout
|
|
|
|
In old calico notes on github it is referenced that overlapping
|
|
Pod/CIDR networks might be a problem. Additionally we cannot use
|
|
kubeadm to initialise the podsubnet to be a proper subnet of the node
|
|
subnet:
|
|
|
|
```
|
|
[00:15] server57.place5:~# kubeadm init --service-cidr 2a0a:e5c0:0:cccc::/108 --pod-network-cidr 2a0a:e5c0::/100
|
|
I0829 00:16:38.659341 19400 version.go:255] remote version is much newer: v1.25.0; falling back to: stable-1.24
|
|
podSubnet: Invalid value: "2a0a:e5c0::/100": the size of pod subnet with mask 100 is smaller than the size of node subnet with mask 64
|
|
To see the stack trace of this error execute with --v=5 or higher
|
|
[00:16] server57.place5:~#
|
|
```
|
|
|
|
### Networking 2022-09-03
|
|
|
|
* Instead of trying to merge the cluster networks, we will use
|
|
separate ranges
|
|
* According to the [ceph users mailing list
|
|
discussion](https://www.spinics.net/lists/ceph-users/msg73421.html)
|
|
it is actually not necessary for mons/osds to be in the same
|
|
network. In fact, we might be able to remove these settings
|
|
completely.
|
|
|
|
So today we start with
|
|
|
|
* podSubnet: 2a0a:e5c0:0:14::/64
|
|
* serviceSubnet: 2a0a:e5c0:0:15::/108
|
|
|
|
Using BGP and calico, the kubernetes cluster is setup "as usual" (for
|
|
ungleich terms).
|
|
|
|
### Ceph.conf change
|
|
|
|
Originally our ceph.conf contained:
|
|
|
|
```
|
|
public network = 2a0a:e5c0:0:0::/64
|
|
cluster network = 2a0a:e5c0:0:0::/64
|
|
```
|
|
|
|
As of today they are removed and all daemons are restarted, allowing
|
|
the native cluster to speak with the kubernetes cluster.
|
|
|
|
### Setting up rook
|
|
|
|
Usually we deploy rook via argocd. However as we want to be easily
|
|
able to do manual intervention, we will first bootstrap rook via helm
|
|
directly and turn off various services
|
|
|
|
```
|
|
helm repo add rook https://charts.rook.io/release
|
|
helm repo update
|
|
```
|
|
|
|
We will use rook 1.8, as it is the last version to support Ceph
|
|
nautilus, which is our current ceph version. The latest 1.8 version is
|
|
1.8.10 at the moment.
|
|
|
|
```
|
|
helm upgrade --install --namespace rook-ceph --create-namespace --version v1.8.10 rook-ceph rook/rook-ceph
|
|
```
|
|
|
|
### Joining the 2 clusters, step 1: monitors and managers
|
|
|
|
In the first step we want to add rook based monitors and managers
|
|
and replace the native ones. For rook to be able to talk to our
|
|
existing cluster, it needs to know
|
|
|
|
* the current monitors/managers ("the monmap")
|
|
* the right keys to talk to the existing cluster
|
|
* the fsid
|
|
|
|
As we are using v1.8, we will follow
|
|
[the guidelines for disaster recover of rook
|
|
1.8](https://www.rook.io/docs/rook/v1.8/ceph-disaster-recovery.html).
|
|
|
|
Later we will need to create all the configurations so that rook knows
|
|
about the different pools.
|
|
|
|
### Rook: CephCluster
|
|
|
|
Rook has a configuration of type `CephCluster` that typically looks
|
|
something like this:
|
|
|
|
```
|
|
apiVersion: ceph.rook.io/v1
|
|
kind: CephCluster
|
|
metadata:
|
|
name: rook-ceph
|
|
namespace: rook-ceph
|
|
spec:
|
|
cephVersion:
|
|
# see the "Cluster Settings" section below for more details on which image of ceph to run
|
|
image: quay.io/ceph/ceph:{{ .Chart.AppVersion }}
|
|
dataDirHostPath: /var/lib/rook
|
|
mon:
|
|
count: 5
|
|
allowMultiplePerNode: false
|
|
storage:
|
|
useAllNodes: true
|
|
useAllDevices: true
|
|
onlyApplyOSDPlacement: false
|
|
mgr:
|
|
count: 1
|
|
modules:
|
|
- name: pg_autoscaler
|
|
enabled: true
|
|
network:
|
|
ipFamily: "IPv6"
|
|
dualStack: false
|
|
crashCollector:
|
|
disable: false
|
|
# Uncomment daysToRetain to prune ceph crash entries older than the
|
|
# specified number of days.
|
|
daysToRetain: 30
|
|
```
|
|
|
|
For migrating, we don't want rook in the first stage to create any
|
|
OSDs. So we will replace `useAllNodes: true` with `useAllNodes: false`
|
|
and `useAllDevices: true` also with `useAllDevices: false`.
|
|
|
|
### Extracting a monmap
|
|
|
|
To get access to the existing monmap, we can export it from the native
|
|
cluster using `ceph-mon -i {mon-id} --extract-monmap {map-path}`.
|
|
More details can be found on the [documentation for adding and
|
|
removing ceph
|
|
monitors](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/).
|
|
|
|
|
|
|
|
### Rook and Ceph pools
|
|
|
|
Rook uses `CephBlockPool` to describe ceph pools as follows:
|
|
|
|
```
|
|
apiVersion: ceph.rook.io/v1
|
|
kind: CephBlockPool
|
|
metadata:
|
|
name: hdd
|
|
namespace: rook-ceph
|
|
spec:
|
|
failureDomain: host
|
|
replicated:
|
|
size: 3
|
|
deviceClass: hdd
|
|
```
|
|
|
|
In this particular cluster we have 2 pools:
|
|
|
|
- one (ssd based, device class = ssd)
|
|
- hdd (hdd based, device class = hdd-big)
|
|
|
|
The device class "hdd-big" is specific to this cluster as it used to
|
|
contain 2.5" and 3.5" HDDs in different pools.
|
|
|
|
|
|
|
|
## Changelog
|
|
|
|
### 2022-09-03
|
|
|
|
* Next try starting for migration
|
|
|
|
### 2022-08-29
|
|
|
|
* Added kubernetes/kubeadm bootstrap issue
|
|
|
|
### 2022-08-27
|
|
|
|
* The initial release of this blog article
|
|
* Added k8s bootstrapping guide
|
|
|
|
## Follow up or questions
|
|
|
|
You can join the discussion in the matrix room `#kubernetes:ungleich.ch`
|
|
about this migration. If don't have a matrix
|
|
account you can join using our chat on https://chat.with.ungleich.ch.
|