www.nico.schottelius.org/blog/k8s-ipv6-only-cluster.mdwn

249 lines
10 KiB
Markdown

[[!meta title="Building an IPv6 only kubernetes cluster"]]
## Introduction
For a few weeks I am working on my pet project to create a production
ready kubernetes cluster that runs in an IPv6 only environment.
As the complexity and challenges for this project are rather
interesting, I decided to start documenting them in this blog post.
The
[ungleich-k8s](https://code.ungleich.ch/ungleich-public/ungleich-k8s)
contanins all snippets and latest code.
## Objective
The kubernetes cluster should support the following work loads:
* Matrix Chat instances (Synapse+postgres+nginx+element)
* Virtual Machines (via kubevirt)
* Provide storage to internal and external consumers using Ceph
## Components
The following is a list of components that I am using so far. This
might change on the way, but I wanted to list already what I selected
and why.
### OS: Alpine Linux
The operating system of choice to run the k8s cluster is
[Alpine Linux](https://www.alpinelinux.org/) as it is small, stable
and supports both docker and cri-o.
### Container management: docker
Originally I started with [cri-o](https://cri-o.io/). However using
cri-o together with kubevirt and calico results in an overlayfs placed
on / of the host, which breaks the full host functionality (see below
for details).
Docker, while being deprecated, allows me to get kubevirt generally
speaking running.
### Networking: IPv6 only, calico
I wanted to go with [cilium](https://cilium.io/) first, because it
goes down the eBPF route from the get go. However cilium does not yet
contain native and automated BGP peering with the upstream
infrastructure, so managing nodes / ip network peering becomes a
tedious, manual and error prone task. Cilium is on the way to improve
this, but is not there yet.
[Calico](https://www.projectcalico.org/) on the other hand still
relies on ip(6)tables and kube-proxy for forwarding traffic, but has
for a long time proper BGP support. Calico also aims to add eBPF
support, however at the moment it does not support IPv6 yet (bummer!).
### Storage: rook
[Rook](https://rook.io/) seems to be the first choice if you search
who is doing what storage providers in the k8s world. It looks rather
proper, even though some knobs are not yet clear to me.
Rook, in my opinion, is a direct alternative of running cephadm, which
requires systemd running on your hosts. Which, given Alpine Linux,
will never be the case.
### Virtualisation
[Kubevirt](https://kubevirt.io/) seems to provide a good
interface. Mid term, kubevirt is projected to replace
[OpenNebula](https://opennebula.io/) at
[ungleich](https://ungleich.ch).
## Challenges
### cri-o + calico + kubevirt = broken host
So this is a rather funky one. If you deploy cri-o and calico,
everything works. If you then deploy kubevirt, the **virt-handler**
pod fails to come up with the error message
Error: path "/var/run/kubevirt" is mounted on "/" but it is not a shared mount.
In the Internet there are two recommendations to fix this:
* Fix the systemd unit for docker: Obviously, using neither of them,
this is not applicable...
* Issue **mount --make-shared /**
The second command has a very strange side effect: Issueing that, the
contents of a calico pod are mounted as an overlayfs **on / of the
host**. This covers /proc and thus things like **ps**, **mount** and
co. fail and basically the whole system becomes unusable until reboot.
This is fully reproducible. I first suspected the tmpfs on / to be the
issue, used some disks instead of booting over network to check it and
even a regular ext4 on / causes the exact same problem.
### docker + calico + kubevirt = other shared mounts
Now, given that cri-o + calico + kubevirt does not lead to the
expected result, what does the same setup with docker look like? The
calico node pods with docker fail to come up, if /sys is not
shared mounted, the virt-handler pods fail if /run is not shared
mounted.
Two funky findings:
Issueing the following commands makes both work:
mount --make-shared /sys
mount --make-shared /run
The paths are totally different between docker and cri-o, even though
the mapped hostpaths in the pod description are the same. And why is
having /sys not being shared not a problem for calico in cri-o?
## Log
### Status 2021-06-07
Today I have updated the ceph cluster definition in rook to
* check hosts every 10 minutes instead of 60m for new disks
* use IPv6 instead of IPv6
The succesful ceph -s output:
[20:42] server47.place7:~/ungleich-k8s/rook# kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s
cluster:
id: 049110d9-9368-4750-b3d3-6ca9a80553d7
health: HEALTH_WARN
mons are allowing insecure global_id reclaim
services:
mon: 3 daemons, quorum a,b,d (age 75m)
mgr: a(active, since 74m), standbys: b
osd: 6 osds: 6 up (since 43m), 6 in (since 44m)
data:
pools: 2 pools, 33 pgs
objects: 6 objects, 34 B
usage: 37 MiB used, 45 GiB / 45 GiB avail
pgs: 33 active+clean
The result is a working ceph clusters with RBD support. I also applied
the cephfs manifest, however RWX volumes (readwritemany) are not yet
spinning up. It seems that test [helm charts](https://artifacthub.io/)
often require RWX instead of RWO (readwriteonce) access.
Also the ceph dashboard does not come up, even though it is
configured:
[20:44] server47.place7:~# kubectl -n rook-ceph get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
csi-cephfsplugin-metrics ClusterIP 2a0a:e5c0:13:e2::760b <none> 8080/TCP,8081/TCP 82m
csi-rbdplugin-metrics ClusterIP 2a0a:e5c0:13:e2::482d <none> 8080/TCP,8081/TCP 82m
rook-ceph-mgr ClusterIP 2a0a:e5c0:13:e2::6ab9 <none> 9283/TCP 77m
rook-ceph-mgr-dashboard ClusterIP 2a0a:e5c0:13:e2::5a14 <none> 7000/TCP 77m
rook-ceph-mon-a ClusterIP 2a0a:e5c0:13:e2::c39e <none> 6789/TCP,3300/TCP 83m
rook-ceph-mon-b ClusterIP 2a0a:e5c0:13:e2::732a <none> 6789/TCP,3300/TCP 81m
rook-ceph-mon-d ClusterIP 2a0a:e5c0:13:e2::c658 <none> 6789/TCP,3300/TCP 76m
[20:44] server47.place7:~# curl http://[2a0a:e5c0:13:e2::5a14]:7000
curl: (7) Failed to connect to 2a0a:e5c0:13:e2::5a14 port 7000: Connection refused
[20:45] server47.place7:~#
The ceph mgr is perfectly reachable though:
[20:45] server47.place7:~# curl -s http://[2a0a:e5c0:13:e2::6ab9]:9283/metrics | head
# HELP ceph_health_status Cluster health status
# TYPE ceph_health_status untyped
ceph_health_status 1.0
# HELP ceph_mon_quorum_status Monitors in quorum
# TYPE ceph_mon_quorum_status gauge
ceph_mon_quorum_status{ceph_daemon="mon.a"} 1.0
ceph_mon_quorum_status{ceph_daemon="mon.b"} 1.0
ceph_mon_quorum_status{ceph_daemon="mon.d"} 1.0
# HELP ceph_fs_metadata FS Metadata
### Status 2021-06-06
Today is the first day of publishing the findings and this blog
article will lack quite some information. If you are curious and want
to know more that is not yet published, you can find me on Matrix
in the **#hacking:ungleich.ch** room.
#### What works so far
* Spawing pods IPv6 only
* Spawing IPv6 only services works
* BGP Peering and ECMP routes with the upstream infrastructure works
Here's an output of the upstream bird process for the routes from k8s:
bird> show route
Table master6:
2a0a:e5c0:13:e2::/108 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
unicast [place7-server3 2021-06-05] (100) [AS65534i]
via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
unicast [place7-server4 2021-06-05] (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
unicast [place7-server2 2021-06-05] (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
2a0a:e5c0:13:e1:176b:eaa6:6d47:1c40/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
unicast [place7-server4 23:45:21.591] (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
unicast [place7-server3 23:45:21.591] (100) [AS65534i]
via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
unicast [place7-server2 23:45:21.589] (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
2a0a:e5c0:13:e1:e0d1:d390:343e:8480/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
unicast [place7-server3 2021-06-05] (100) [AS65534i]
via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
unicast [place7-server4 2021-06-05] (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
unicast [place7-server2 2021-06-05] (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
2a0a:e5c0:13::/48 unreachable [v6 2021-05-16] * (200)
2a0a:e5c0:13:e1:9b19:7142:bebb:4d80/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
unicast [place7-server3 2021-06-05] (100) [AS65534i]
via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
unicast [place7-server4 2021-06-05] (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
unicast [place7-server2 2021-06-05] (100) [AS65534i]
via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
bird>
#### What doesn't work
* Rook does not format/spinup all disks
* Deleting all rook components fails (**kubectl delete -f cluster.yaml
hangs** forever)
* Spawning VMs fails with **error: unable to recognize "vmi.yaml": no matches for kind "VirtualMachineInstance" in version "kubevirt.io/v1"**
[[!tag kubernetes ipv6]]