248 lines
10 KiB
Markdown
248 lines
10 KiB
Markdown
[[!meta title="Building an IPv6 only kubernetes cluster"]]
|
|
|
|
## Introduction
|
|
|
|
For a few weeks I am working on my pet project to create a production
|
|
ready kubernetes cluster that runs in an IPv6 only environment.
|
|
|
|
As the complexity and challenges for this project are rather
|
|
interesting, I decided to start documenting them in this blog post.
|
|
|
|
The
|
|
[ungleich-k8s](https://code.ungleich.ch/ungleich-public/ungleich-k8s)
|
|
contanins all snippets and latest code.
|
|
|
|
|
|
## Objective
|
|
|
|
The kubernetes cluster should support the following work loads:
|
|
|
|
* Matrix Chat instances (Synapse+postgres+nginx+element)
|
|
* Virtual Machines (via kubevirt)
|
|
* Provide storage to internal and external consumers using Ceph
|
|
|
|
## Components
|
|
|
|
The following is a list of components that I am using so far. This
|
|
might change on the way, but I wanted to list already what I selected
|
|
and why.
|
|
|
|
### OS: Alpine Linux
|
|
|
|
The operating system of choice to run the k8s cluster is
|
|
[Alpine Linux](https://www.alpinelinux.org/) as it is small, stable
|
|
and supports both docker and cri-o.
|
|
|
|
### Container management: docker
|
|
|
|
Originally I started with [cri-o](https://cri-o.io/). However using
|
|
cri-o together with kubevirt and calico results in an overlayfs placed
|
|
on / of the host, which breaks the full host functionality (see below
|
|
for details).
|
|
|
|
Docker, while being deprecated, allows me to get kubevirt generally
|
|
speaking running.
|
|
|
|
### Networking: IPv6 only, calico
|
|
|
|
I wanted to go with [cilium](https://cilium.io/) first, because it
|
|
goes down the eBPF route from the get go. However cilium does not yet
|
|
contain native and automated BGP peering with the upstream
|
|
infrastructure, so managing nodes / ip network peering becomes a
|
|
tedious, manual and error prone task. Cilium is on the way to improve
|
|
this, but is not there yet.
|
|
|
|
[Calico](https://www.projectcalico.org/) on the other hand still
|
|
relies on ip(6)tables and kube-proxy for forwarding traffic, but has
|
|
for a long time proper BGP support. Calico also aims to add eBPF
|
|
support, however at the moment it does not support IPv6 yet (bummer!).
|
|
|
|
### Storage: rook
|
|
|
|
[Rook](https://rook.io/) seems to be the first choice if you search
|
|
who is doing what storage providers in the k8s world. It looks rather
|
|
proper, even though some knobs are not yet clear to me.
|
|
|
|
Rook, in my opinion, is a direct alternative of running cephadm, which
|
|
requires systemd running on your hosts. Which, given Alpine Linux,
|
|
will never be the case.
|
|
|
|
### Virtualisation
|
|
|
|
[Kubevirt](https://kubevirt.io/) seems to provide a good
|
|
interface. Mid term, kubevirt is projected to replace
|
|
[OpenNebula](https://opennebula.io/) at
|
|
[ungleich](https://ungleich.ch).
|
|
|
|
|
|
## Challenges
|
|
|
|
### cri-o + calico + kubevirt = broken host
|
|
|
|
So this is a rather funky one. If you deploy cri-o and calico,
|
|
everything works. If you then deploy kubevirt, the **virt-handler**
|
|
pod fails to come up with the error message
|
|
|
|
Error: path "/var/run/kubevirt" is mounted on "/" but it is not a shared mount.
|
|
|
|
In the Internet there are two recommendations to fix this:
|
|
|
|
* Fix the systemd unit for docker: Obviously, using neither of them,
|
|
this is not applicable...
|
|
* Issue **mount --make-shared /**
|
|
|
|
The second command has a very strange side effect: Issueing that, the
|
|
contents of a calico pod are mounted as an overlayfs **on / of the
|
|
host**. This covers /proc and thus things like **ps**, **mount** and
|
|
co. fail and basically the whole system becomes unusable until reboot.
|
|
|
|
This is fully reproducible. I first suspected the tmpfs on / to be the
|
|
issue, used some disks instead of booting over network to check it and
|
|
even a regular ext4 on / causes the exact same problem.
|
|
|
|
### docker + calico + kubevirt = other shared mounts
|
|
|
|
Now, given that cri-o + calico + kubevirt does not lead to the
|
|
expected result, what does the same setup with docker look like? The
|
|
calico node pods with docker fail to come up, if /sys is not
|
|
shared mounted, the virt-handler pods fail if /run is not shared
|
|
mounted.
|
|
|
|
Two funky findings:
|
|
|
|
Issueing the following commands makes both work:
|
|
|
|
mount --make-shared /sys
|
|
mount --make-shared /run
|
|
|
|
The paths are totally different between docker and cri-o, even though
|
|
the mapped hostpaths in the pod description are the same. And why is
|
|
having /sys not being shared not a problem for calico in cri-o?
|
|
|
|
## Log
|
|
|
|
### Status 2021-06-07
|
|
|
|
Today I have updated the ceph cluster definition in rook to
|
|
|
|
* check hosts every 10 minutes instead of 60m for new disks
|
|
* use IPv6 instead of IPv6
|
|
|
|
The succesful ceph -s output:
|
|
|
|
[20:42] server47.place7:~/ungleich-k8s/rook# kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s
|
|
cluster:
|
|
id: 049110d9-9368-4750-b3d3-6ca9a80553d7
|
|
health: HEALTH_WARN
|
|
mons are allowing insecure global_id reclaim
|
|
|
|
services:
|
|
mon: 3 daemons, quorum a,b,d (age 75m)
|
|
mgr: a(active, since 74m), standbys: b
|
|
osd: 6 osds: 6 up (since 43m), 6 in (since 44m)
|
|
|
|
data:
|
|
pools: 2 pools, 33 pgs
|
|
objects: 6 objects, 34 B
|
|
usage: 37 MiB used, 45 GiB / 45 GiB avail
|
|
pgs: 33 active+clean
|
|
|
|
|
|
The result is a working ceph clusters with RBD support. I also applied
|
|
the cephfs manifest, however RWX volumes (readwritemany) are not yet
|
|
spinning up. It seems that test [helm charts](https://artifacthub.io/)
|
|
often require RWX instead of RWO (readwriteonce) access.
|
|
|
|
Also the ceph dashboard does not come up, even though it is
|
|
configured:
|
|
|
|
[20:44] server47.place7:~# kubectl -n rook-ceph get svc
|
|
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
|
csi-cephfsplugin-metrics ClusterIP 2a0a:e5c0:13:e2::760b <none> 8080/TCP,8081/TCP 82m
|
|
csi-rbdplugin-metrics ClusterIP 2a0a:e5c0:13:e2::482d <none> 8080/TCP,8081/TCP 82m
|
|
rook-ceph-mgr ClusterIP 2a0a:e5c0:13:e2::6ab9 <none> 9283/TCP 77m
|
|
rook-ceph-mgr-dashboard ClusterIP 2a0a:e5c0:13:e2::5a14 <none> 7000/TCP 77m
|
|
rook-ceph-mon-a ClusterIP 2a0a:e5c0:13:e2::c39e <none> 6789/TCP,3300/TCP 83m
|
|
rook-ceph-mon-b ClusterIP 2a0a:e5c0:13:e2::732a <none> 6789/TCP,3300/TCP 81m
|
|
rook-ceph-mon-d ClusterIP 2a0a:e5c0:13:e2::c658 <none> 6789/TCP,3300/TCP 76m
|
|
[20:44] server47.place7:~# curl http://[2a0a:e5c0:13:e2::5a14]:7000
|
|
curl: (7) Failed to connect to 2a0a:e5c0:13:e2::5a14 port 7000: Connection refused
|
|
[20:45] server47.place7:~#
|
|
|
|
The ceph mgr is perfectly reachable though:
|
|
|
|
[20:45] server47.place7:~# curl -s http://[2a0a:e5c0:13:e2::6ab9]:9283/metrics | head
|
|
|
|
# HELP ceph_health_status Cluster health status
|
|
# TYPE ceph_health_status untyped
|
|
ceph_health_status 1.0
|
|
# HELP ceph_mon_quorum_status Monitors in quorum
|
|
# TYPE ceph_mon_quorum_status gauge
|
|
ceph_mon_quorum_status{ceph_daemon="mon.a"} 1.0
|
|
ceph_mon_quorum_status{ceph_daemon="mon.b"} 1.0
|
|
ceph_mon_quorum_status{ceph_daemon="mon.d"} 1.0
|
|
# HELP ceph_fs_metadata FS Metadata
|
|
|
|
|
|
### Status 2021-06-06
|
|
|
|
Today is the first day of publishing the findings and this blog
|
|
article will lack quite some information. If you are curious and want
|
|
to know more that is not yet published, you can find me on Matrix
|
|
in the **#hacking:ungleich.ch** room.
|
|
|
|
#### What works so far
|
|
|
|
* Spawing pods IPv6 only
|
|
* Spawing IPv6 only services works
|
|
* BGP Peering and ECMP routes with the upstream infrastructure works
|
|
|
|
Here's an output of the upstream bird process for the routes from k8s:
|
|
|
|
bird> show route
|
|
Table master6:
|
|
2a0a:e5c0:13:e2::/108 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
|
|
unicast [place7-server3 2021-06-05] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
|
|
unicast [place7-server4 2021-06-05] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
|
|
unicast [place7-server2 2021-06-05] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
|
|
2a0a:e5c0:13:e1:176b:eaa6:6d47:1c40/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
|
|
unicast [place7-server4 23:45:21.591] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
|
|
unicast [place7-server3 23:45:21.591] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
|
|
unicast [place7-server2 23:45:21.589] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
|
|
2a0a:e5c0:13:e1:e0d1:d390:343e:8480/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
|
|
unicast [place7-server3 2021-06-05] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
|
|
unicast [place7-server4 2021-06-05] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
|
|
unicast [place7-server2 2021-06-05] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
|
|
2a0a:e5c0:13::/48 unreachable [v6 2021-05-16] * (200)
|
|
2a0a:e5c0:13:e1:9b19:7142:bebb:4d80/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
|
|
unicast [place7-server3 2021-06-05] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
|
|
unicast [place7-server4 2021-06-05] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
|
|
unicast [place7-server2 2021-06-05] (100) [AS65534i]
|
|
via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
|
|
bird>
|
|
|
|
|
|
#### What doesn't work
|
|
|
|
* Rook does not format/spinup all disks
|
|
* Deleting all rook components fails (**kubectl delete -f cluster.yaml
|
|
hangs** forever)
|
|
* Spawning VMs fails with **error: unable to recognize "vmi.yaml": no matches for kind "VirtualMachineInstance" in version "kubevirt.io/v1"**
|
|
|
|
|
|
[[!tag kubernetes ipv6]]
|