[[!meta title="Building an IPv6 only kubernetes cluster"]] ## Introduction For a weeks I am working on my pet project to create a production ready kubernetes cluster that runs in an IPv6 only environment. As the complexity and challenges for this project are rather interesting, I decided to start documenting them in this blog post. The [ungleich-k8s](https://code.ungleich.ch/ungleich-public/ungleich-k8s) contanins all snippets and latest code. ## Objective The kubernetes cluster should support the following work loads: * Matrix Chat instances (Synapse+postgres+nginx+element) * Virtual Machines (via kubevirt) * Provide storage to internal and external consumers using Ceph ## Components The following is a list of components that I am using so far. This might change on the way, but I wanted to list already what I selected and why. ### OS: Alpine Linux The operating system of choice to run the k8s cluster is [Alpine Linux](https://www.alpinelinux.org/) as it is small, stable and supports both docker and cri-o. ### Container management: docker Originally I started with [cri-o](https://cri-o.io/). However using cri-o together with kubevirt and calico results in an overlayfs placed on / of the host, which breaks the full host functionality (see below for details). Docker, while being deprecated, allows me to get kubevirt generally speaking running. ### Networking: IPv6 only, calico I wanted to go with [cilium](https://cilium.io/) first, because it goes down the eBPF route from the get go. However cilium does not yet contain native and automated BGP peering with the upstream infrastructure, so managing nodes / ip network peering becomes a tedious, manual and error prone task. Cilium is on the way to improve this, but is not there yet. [Calico](https://www.projectcalico.org/) on the other hand still relies on ip(6)tables and kube-proxy for forwarding traffic, but has for a long time proper BGP support. Calico also aims to add eBPF support, however at the moment it does not support IPv6 yet (bummer!). ### Storage: rook [Rook](https://rook.io/) seems to be the first choice if you search who is doing what storage providers in the k8s world. It looks rather proper, even though some knobs are not yet clear to me. Rook, in my opinion, is a direct alternative of running cephadm, which requires systemd running on your hosts. Which, given Alpine Linux, will never be the case. ### Virtualisation [Kubevirt](https://kubevirt.io/) seems to provide a good interface. Mid term, kubevirt is projected to replace [OpenNebula](https://opennebula.io/) at [ungleich](https://ungleich.ch). ## Challenges ### cri-o + calico + kubevirt = broken host So this is a rather funky one. If you deploy cri-o and calico, everything works. If you then deploy kubevirt, the **virt-handler** pod fails to come up with the error message Error: path "/var/run/kubevirt" is mounted on "/" but it is not a shared mount. In the Internet there are two recommendations to fix this: * Fix the systemd unit for docker: Obviously, using neither of them, this is not applicable... * Issue **mount --make-shared /** The second command has a very strange side effect: Issueing that, the contents of a calico pod are mounted as an overlayfs **on / of the host**. This covers /proc and thus things like **ps**, **mount** and co. fail and basically the whole system becomes unusable until reboot. This is fully reproducible. I first suspected the tmpfs on / to be the issue, used some disks instead of booting over network to check it and even a regular ext4 on / causes the exact same problem. ### docker + calico + kubevirt = other shared mounts Now, given that cri-o + calico + kubevirt does not lead to the expected result, what does the same setup with docker look like? The calico node pods with docker fail to come up, if /sys is not shared mounted, the virt-handler pods fail if /run is not shared mounted. Two funky findings: Issueing the following commands makes both work: mount --make-shared /sys mount --make-shared /run The paths are totally different between docker and cri-o, even though the mapped hostpaths in the pod description are the same. And why is having /sys not being shared not a problem for calico in cri-o? ## Log ### Status 2021-06-07 Today I have updated the ceph cluster definition in rook to * check hosts every 10 minutes instead of 60m for new disks * use IPv6 instead of IPv6 [20:41] server47.place7:~/ungleich-k8s/rook# kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s cluster: id: 049110d9-9368-4750-b3d3-6ca9a80553d7 health: HEALTH_WARN mons are allowing insecure global_id reclaim services: mon: 3 daemons, quorum a,b,d (age 72m) mgr: a(active, since 72m), standbys: b osd: 6 osds: 6 up (since 41m), 6 in (since 42m) data: pools: 2 pools, 33 pgs objects: 6 objects, 34 B usage: 37 MiB used, 45 GiB / 45 GiB avail pgs: 33 active+clean The result is a working ceph clusters with RBD support. I also applied the cephfs manifest, however RWX volumes (readwritemany) are not yet spinning up. It seems that test [helm charts](https://artifacthub.io/) often require RWX instead of RWO (readwriteonce) access. Also the ceph dashboard does not come up, even though it is configured: [20:44] server47.place7:~# kubectl -n rook-ceph get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE csi-cephfsplugin-metrics ClusterIP 2a0a:e5c0:13:e2::760b 8080/TCP,8081/TCP 82m csi-rbdplugin-metrics ClusterIP 2a0a:e5c0:13:e2::482d 8080/TCP,8081/TCP 82m rook-ceph-mgr ClusterIP 2a0a:e5c0:13:e2::6ab9 9283/TCP 77m rook-ceph-mgr-dashboard ClusterIP 2a0a:e5c0:13:e2::5a14 7000/TCP 77m rook-ceph-mon-a ClusterIP 2a0a:e5c0:13:e2::c39e 6789/TCP,3300/TCP 83m rook-ceph-mon-b ClusterIP 2a0a:e5c0:13:e2::732a 6789/TCP,3300/TCP 81m rook-ceph-mon-d ClusterIP 2a0a:e5c0:13:e2::c658 6789/TCP,3300/TCP 76m [20:44] server47.place7:~# curl http://[2a0a:e5c0:13:e2::5a14]:7000 curl: (7) Failed to connect to 2a0a:e5c0:13:e2::5a14 port 7000: Connection refused [20:45] server47.place7:~# The ceph mgr is perfectly reachable though: [20:45] server47.place7:~# curl -s http://[2a0a:e5c0:13:e2::6ab9]:9283/metrics | head # HELP ceph_health_status Cluster health status # TYPE ceph_health_status untyped ceph_health_status 1.0 # HELP ceph_mon_quorum_status Monitors in quorum # TYPE ceph_mon_quorum_status gauge ceph_mon_quorum_status{ceph_daemon="mon.a"} 1.0 ceph_mon_quorum_status{ceph_daemon="mon.b"} 1.0 ceph_mon_quorum_status{ceph_daemon="mon.d"} 1.0 # HELP ceph_fs_metadata FS Metadata ### Status 2021-06-06 Today is the first day of publishing the findings and this blog article will lack quite some information. If you are curious and want to know more that is not yet published, you can find me on Matrix in the **#hacking:ungleich.ch** room. #### What works so far * Spawing pods IPv6 only * Spawing IPv6 only services works * BGP Peering and ECMP routes with the upstream infrastructure works Here's an output of the upstream bird process for the routes from k8s: bird> show route Table master6: 2a0a:e5c0:13:e2::/108 unicast [place7-server1 23:45:21.589] * (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0 unicast [place7-server3 2021-06-05] (100) [AS65534i] via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0 unicast [place7-server4 2021-06-05] (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0 unicast [place7-server2 2021-06-05] (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0 2a0a:e5c0:13:e1:176b:eaa6:6d47:1c40/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0 unicast [place7-server4 23:45:21.591] (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0 unicast [place7-server3 23:45:21.591] (100) [AS65534i] via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0 unicast [place7-server2 23:45:21.589] (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0 2a0a:e5c0:13:e1:e0d1:d390:343e:8480/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0 unicast [place7-server3 2021-06-05] (100) [AS65534i] via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0 unicast [place7-server4 2021-06-05] (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0 unicast [place7-server2 2021-06-05] (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0 2a0a:e5c0:13::/48 unreachable [v6 2021-05-16] * (200) 2a0a:e5c0:13:e1:9b19:7142:bebb:4d80/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0 unicast [place7-server3 2021-06-05] (100) [AS65534i] via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0 unicast [place7-server4 2021-06-05] (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0 unicast [place7-server2 2021-06-05] (100) [AS65534i] via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0 bird> #### What doesn't work * Rook does not format/spinup all disks * Deleting all rook components fails (**kubectl delete -f cluster.yaml hangs** forever) * Spawning VMs fails with **error: unable to recognize "vmi.yaml": no matches for kind "VirtualMachineInstance" in version "kubevirt.io/v1"** [[!tag kubernetes ipv6]]