355 lines
9.1 KiB
Markdown
355 lines
9.1 KiB
Markdown
title: [WIP] Migrating Ceph Nautilus into Kubernetes + Rook
|
|
---
|
|
pub_date: 2022-08-27
|
|
---
|
|
author: ungleich storage team
|
|
---
|
|
twitter_handle: ungleich
|
|
---
|
|
_hidden: no
|
|
---
|
|
_discoverable: yes
|
|
---
|
|
abstract:
|
|
How we move our Ceph clusters into kubernetes
|
|
---
|
|
body:
|
|
|
|
## Introduction
|
|
|
|
At ungleich we are running multiple Ceph clusters. Some of them are
|
|
running Ceph Nautilus (14.x) based on
|
|
[Devuan](https://www.devuan.org/). Our newer Ceph Pacific (16.x)
|
|
clusters are running based on [Rook](https://rook.io/) on
|
|
[Kubernetes](https://kubernetes.io/) on top of
|
|
[Alpine Linux](https://alpinelinux.org/).
|
|
|
|
In this blog article we will describe how to migrate
|
|
Ceph/Native/Devuan to Ceph/k8s+rook/Alpine Linux.
|
|
|
|
## Work in Progress [WIP]
|
|
|
|
This blog article is work in progress. The migration planning has
|
|
started, however the migration has not been finished yet. This article
|
|
will feature the different paths we take for the migration.
|
|
|
|
## The Plan
|
|
|
|
To continue operating the cluster during the migration, the following
|
|
steps are planned:
|
|
|
|
* Setup a k8s cluster that can potentially communicate with the
|
|
existing ceph cluster
|
|
* Using the [disaster
|
|
recovery](https://rook.io/docs/rook/v1.9/Troubleshooting/disaster-recovery/)
|
|
guidelines from rook to modify the rook configuration to use the
|
|
previous fsid.
|
|
* Spin up ceph monitors and ceph managers in rook
|
|
* Retire existing monitors
|
|
* Shutdown a ceph OSD node, remove it's OS disk, boot it with Alpine
|
|
Linux
|
|
* Join the node into the k8s cluster
|
|
* Have rook pickup the existing disks and start the osds
|
|
* Repeat if successful
|
|
* Migrate to ceph pacific
|
|
|
|
### Original cluster
|
|
|
|
The target ceph cluster we want to migrate lives in the 2a0a:e5c0::/64
|
|
network. Ceph is using:
|
|
|
|
```
|
|
public network = 2a0a:e5c0:0:0::/64
|
|
cluster network = 2a0a:e5c0:0:0::/64
|
|
```
|
|
|
|
### Kubernetes cluster networking inside the ceph network
|
|
|
|
To be able to communicate with the existing OSDs, we will be using
|
|
sub networks of 2a0a:e5c0::/64 for kubernetes. As these networks
|
|
are part of the link assigned network 2a0a:e5c0::/64, we will use BGP
|
|
routing on the existing ceph nodes to create more specific routes into
|
|
the kubernetes cluster.
|
|
|
|
As we plan to use either [cilium](https://cilium.io/) or
|
|
[calico](https://www.tigera.io/project-calico/) as the CNI, we can
|
|
configure kubernetes to directly BGP peer with the existing Ceph
|
|
nodes.
|
|
|
|
## The setup
|
|
|
|
### Kubernetes Bootstrap
|
|
|
|
As usual we bootstrap 3 control plane nodes using kubeadm. The proxy
|
|
for the API resides in a different kuberentes cluster.
|
|
|
|
We run
|
|
|
|
```
|
|
kubeadm init --config kubeadm.yaml
|
|
```
|
|
|
|
on the first node and join the other two control plane nodes. As
|
|
usual, joining the workers last.
|
|
|
|
### k8s Networking / CNI
|
|
|
|
For this setup we are using calico as described in the
|
|
[ungleich kubernetes
|
|
manual](https://redmine.ungleich.ch/projects/open-infrastructure/wiki/The_ungleich_kubernetes_infrastructure#section-23).
|
|
|
|
```
|
|
VERSION=v3.23.3
|
|
helm repo add projectcalico https://docs.projectcalico.org/charts
|
|
helm upgrade --install --namespace tigera calico projectcalico/tigera-operator --version $VERSION --create-namespace
|
|
```
|
|
|
|
### BGP Networking on the old nodes
|
|
|
|
To be able to import the BGP routes from Kubernetes, all old / native
|
|
hosts will run bird. The installation and configuration is as follows:
|
|
|
|
```
|
|
apt-get update
|
|
apt-get install -y bird2
|
|
|
|
router_id=$(hostname | sed 's/server//')
|
|
|
|
cat > /etc/bird/bird.conf <<EOF
|
|
|
|
router id $router_id;
|
|
|
|
log syslog all;
|
|
protocol device {
|
|
}
|
|
# We are only interested in IPv6, skip another section for IPv4
|
|
protocol kernel {
|
|
ipv6 { export all; };
|
|
}
|
|
protocol bgp k8s {
|
|
local as 65530;
|
|
neighbor range 2a0a:e5c0::/64 as 65533;
|
|
dynamic name "k8s_"; direct;
|
|
|
|
ipv6 {
|
|
import filter { if net.len > 64 then accept; else reject; };
|
|
export none;
|
|
};
|
|
}
|
|
EOF
|
|
/etc/init.d/bird restart
|
|
|
|
```
|
|
|
|
The router id must be adjusted for every host. As all hosts have a
|
|
unique number, we use that one as the router id.
|
|
The bird configuration allows to use dynamic peers so that any k8s
|
|
node in the network can peer with the old servers.
|
|
|
|
We also use a filter to avoid receiving /64 routes, as they are
|
|
overlapping with the on link route.
|
|
|
|
### BGP networking in Kubernetes
|
|
|
|
Calico supports BGP peering and we use a rather standard calico
|
|
configuration:
|
|
|
|
```
|
|
apiVersion: projectcalico.org/v3
|
|
kind: BGPConfiguration
|
|
metadata:
|
|
name: default
|
|
spec:
|
|
logSeverityScreen: Info
|
|
nodeToNodeMeshEnabled: true
|
|
asNumber: 65533
|
|
serviceClusterIPs:
|
|
- cidr: 2a0a:e5c0:0:aaaa::/108
|
|
serviceExternalIPs:
|
|
- cidr: 2a0a:e5c0:0:aaaa::/108
|
|
```
|
|
|
|
Plus for each server and router we create a BGPPeer:
|
|
|
|
```
|
|
apiVersion: projectcalico.org/v3
|
|
kind: BGPPeer
|
|
metadata:
|
|
name: serverXX
|
|
spec:
|
|
peerIP: 2a0a:e5c0::XX
|
|
asNumber: 65530
|
|
keepOriginalNextHop: true
|
|
```
|
|
|
|
We apply the whole configuration using calicoctl:
|
|
|
|
```
|
|
./calicoctl create -f - < ~/vcs/k8s-config/bootstrap/p5-cow/calico-bgp.yaml
|
|
```
|
|
|
|
And a few seconds later we can observer the routes on the old / native
|
|
hosts:
|
|
|
|
```
|
|
bird> show protocols
|
|
Name Proto Table State Since Info
|
|
device1 Device --- up 23:09:01.393
|
|
kernel1 Kernel master6 up 23:09:01.393
|
|
k8s BGP --- start 23:09:01.393 Passive
|
|
k8s_1 BGP --- up 23:33:01.215 Established
|
|
k8s_2 BGP --- up 23:33:01.215 Established
|
|
k8s_3 BGP --- up 23:33:01.420 Established
|
|
k8s_4 BGP --- up 23:33:01.215 Established
|
|
k8s_5 BGP --- up 23:33:01.215 Established
|
|
|
|
```
|
|
|
|
### Testing networking
|
|
|
|
To verify that the new cluster is working properly, we can deploy a
|
|
tiny test deployment and see if it is globally reachable:
|
|
|
|
```
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: nginx-deployment
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: nginx
|
|
replicas: 2
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: nginx
|
|
spec:
|
|
containers:
|
|
- name: nginx
|
|
image: nginx:1.20.0-alpine
|
|
ports:
|
|
- containerPort: 80
|
|
---
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: nginx-service
|
|
spec:
|
|
selector:
|
|
app: nginx
|
|
ports:
|
|
- protocol: TCP
|
|
port: 80
|
|
```
|
|
|
|
Using curl to access a sample service from the outside shows that
|
|
networking is working:
|
|
|
|
```
|
|
% curl -v http://[2a0a:e5c0:0:aaaa::e3c9]
|
|
* Trying 2a0a:e5c0:0:aaaa::e3c9:80...
|
|
* Connected to 2a0a:e5c0:0:aaaa::e3c9 (2a0a:e5c0:0:aaaa::e3c9) port 80 (#0)
|
|
> GET / HTTP/1.1
|
|
> Host: [2a0a:e5c0:0:aaaa::e3c9]
|
|
> User-Agent: curl/7.84.0
|
|
> Accept: */*
|
|
>
|
|
* Mark bundle as not supporting multiuse
|
|
< HTTP/1.1 200 OK
|
|
< Server: nginx/1.20.0
|
|
< Date: Sat, 27 Aug 2022 22:35:49 GMT
|
|
< Content-Type: text/html
|
|
< Content-Length: 612
|
|
< Last-Modified: Tue, 20 Apr 2021 16:11:05 GMT
|
|
< Connection: keep-alive
|
|
< ETag: "607efd19-264"
|
|
< Accept-Ranges: bytes
|
|
<
|
|
<!DOCTYPE html>
|
|
<html>
|
|
<head>
|
|
<title>Welcome to nginx!</title>
|
|
<style>
|
|
body {
|
|
width: 35em;
|
|
margin: 0 auto;
|
|
font-family: Tahoma, Verdana, Arial, sans-serif;
|
|
}
|
|
</style>
|
|
</head>
|
|
<body>
|
|
<h1>Welcome to nginx!</h1>
|
|
<p>If you see this page, the nginx web server is successfully installed and
|
|
working. Further configuration is required.</p>
|
|
|
|
<p>For online documentation and support please refer to
|
|
<a href="http://nginx.org/">nginx.org</a>.<br/>
|
|
Commercial support is available at
|
|
<a href="http://nginx.com/">nginx.com</a>.</p>
|
|
|
|
<p><em>Thank you for using nginx.</em></p>
|
|
</body>
|
|
</html>
|
|
* Connection #0 to host 2a0a:e5c0:0:aaaa::e3c9 left intact
|
|
```
|
|
|
|
So far we have found 1 issue:
|
|
|
|
* Sometimes the old/native servers can reach the service, sometimes
|
|
they get a timeout
|
|
|
|
In old calico notes on github it is referenced that overlapping
|
|
Pod/CIDR networks might be a problem. Additionally we cannot use
|
|
kubeadm to initialise the podsubnet to be a proper subnet of the node
|
|
subnet:
|
|
|
|
```
|
|
[00:15] server57.place5:~# kubeadm init --service-cidr 2a0a:e5c0:0:cccc::/108 --pod-network-cidr 2a0a:e5c0::/100
|
|
I0829 00:16:38.659341 19400 version.go:255] remote version is much newer: v1.25.0; falling back to: stable-1.24
|
|
podSubnet: Invalid value: "2a0a:e5c0::/100": the size of pod subnet with mask 100 is smaller than the size of node subnet with mask 64
|
|
To see the stack trace of this error execute with --v=5 or higher
|
|
[00:16] server57.place5:~#
|
|
```
|
|
|
|
### Networking 2022-09-03
|
|
|
|
* Instead of trying to merge the cluster networks, we will use
|
|
separate ranges
|
|
* According to the [ceph users mailing list
|
|
discussion](https://www.spinics.net/lists/ceph-users/msg73421.html)
|
|
it is actually not necessary for mons/osds to be in the same
|
|
network. In fact, we might be able to remove these settings
|
|
completely.
|
|
|
|
So today we start with
|
|
|
|
* podSubnet: 2a0a:e5c0:0:14::/64
|
|
* serviceSubnet: 2a0a:e5c0:0:15::/108
|
|
|
|
Using BGP and calico, the kubernetes cluster is setup "as usual" (for
|
|
ungleich terms).
|
|
|
|
|
|
|
|
## Changelog
|
|
|
|
### 2022-09-03
|
|
|
|
* Next try starting for migration
|
|
|
|
### 2022-08-29
|
|
|
|
* Added kubernetes/kubeadm bootstrap issue
|
|
|
|
### 2022-08-27
|
|
|
|
* The initial release of this blog article
|
|
* Added k8s bootstrapping guide
|
|
|
|
## Follow up or questions
|
|
|
|
You can join the discussion in the matrix room `#kubernetes:ungleich.ch`
|
|
about this migration. If don't have a matrix
|
|
account you can join using our chat on https://chat.with.ungleich.ch.
|