diff --git a/blog/k8s-ipv6-only-cluster.mdwn b/blog/k8s-ipv6-only-cluster.mdwn new file mode 100644 index 00000000..bc32532b --- /dev/null +++ b/blog/k8s-ipv6-only-cluster.mdwn @@ -0,0 +1,185 @@ +[[!meta title="Building an IPv6 only kubernetes cluster"]] + +## Introduction + +For a weeks I am working on my pet project to create a production +ready kubernetes cluster that runs in an IPv6 only environment. + +As the complexity and challenges for this project are rather +interesting, I decided to start documenting them in this blog post. + +The +[ungleich-k8s](https://code.ungleich.ch/ungleich-public/ungleich-k8s) +contanins all snippets and latest code. + + +## Objective + +The kubernetes cluster should support the following work loads: + +* Matrix Chat instances (Synapse+postgres+nginx+element) +* Virtual Machines (via kubevirt) +* Provide storage to internal and external consumers using Ceph + +## Components + +The following is a list of components that I am using so far. This +might change on the way, but I wanted to list already what I selected +and why. + +### OS: Alpine Linux + +The operating system of choice to run the k8s cluster is +[Alpine Linux](https://www.alpinelinux.org/) as it is small, stable +and supports both docker and cri-o. + +### Container management: docker + +Originally I started with [cri-o](https://cri-o.io/). However using +cri-o together with kubevirt and calico results in an overlayfs placed +on / of the host, which breaks the full host functionality (see below +for details). + +Docker, while being deprecated, allows me to get kubevirt generally +speaking running. + +### Networking: IPv6 only, calico + +I wanted to go with [cilium](https://cilium.io/) first, because it +goes down the eBPF route from the get go. However cilium does not yet +contain native and automated BGP peering with the upstream +infrastructure, so managing nodes / ip network peering becomes a +tedious, manual and error prone task. Cilium is on the way to improve +this, but is not there yet. + +[Calico](https://www.projectcalico.org/) on the other hand still +relies on ip(6)tables and kube-proxy for forwarding traffic, but has +for a long time proper BGP support. Calico also aims to add eBPF +support, however at the moment it does not support IPv6 yet (bummer!). + +### Storage: rook + +[Rook](https://rook.io/) seems to be the first choice if you search +who is doing what storage providers in the k8s world. It looks rather +proper, even though some knobs are not yet clear to me. + +Rook, in my opinion, is a direct alternative of running cephadm, which +requires systemd running on your hosts. Which, given Alpine Linux, +will never be the case. + +### Virtualisation + +[Kubevirt](https://kubevirt.io/) seems to provide a good +interface. Mid term, kubevirt is projected to replace +[OpenNebula](https://opennebula.io/) at +[ungleich](https://ungleich.ch). + + +## Challenges + +### cri-o + calico + kubevirt = broken host + +So this is a rather funky one. If you deploy cri-o and calico, +everything works. If you then deploy kubevirt, the **virt-handler** +pod fails to come up with the error message + + Error: path "/var/run/kubevirt" is mounted on "/" but it is not a shared mount. + +In the Internet there are two recommendations to fix this: + +* Fix the systemd unit for docker: Obviously, using neither of them, + this is not applicable... +* Issue **mount --make-shared /** + +The second command has a very strange side effect: Issueing that, the +contents of a calico pod are mounted as an overlayfs **on / of the +host**. This covers /proc and thus things like **ps**, **mount** and +co. fail and basically the whole system becomes unusable until reboot. + +This is fully reproducible. I first suspected the tmpfs on / to be the +issue, used some disks instead of booting over network to check it and +even a regular ext4 on / causes the exact same problem. + +### docker + calico + kubevirt = other shared mounts + +Now, given that cri-o + calico + kubevirt does not lead to the +expected result, what does the same setup with docker look like? The +calico node pods with docker fail to come up, if /sys is not +shared mounted, the virt-handler pods fail if /run is not shared +mounted. + +Two funky findings: + +Issueing the following commands makes both work: + + mount --make-shared /sys + mount --make-shared /run + +The paths are totally different between docker and cri-o, even though +the mapped hostpaths in the pod description are the same. And why is +having /sys not being shared not a problem for calico in cri-o? + +## Log + +### Status 2021-06-06 + +Today is the first day of publishing the findings and this blog +article will lack quite some information. If you are curious and want +to know more that is not yet published, you can find me on Matrix +in the **#hacking:ungleich.ch** room. + +### What works so far + +* Spawing pods IPv6 only +* Spawing IPv6 only services works +* BGP Peering and ECMP routes with the upstream infrastructure works + +Here's an output of the upstream bird process for the routes from k8s: + + bird> show route + Table master6: + 2a0a:e5c0:13:e2::/108 unicast [place7-server1 23:45:21.589] * (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0 + unicast [place7-server3 2021-06-05] (100) [AS65534i] + via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0 + unicast [place7-server4 2021-06-05] (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0 + unicast [place7-server2 2021-06-05] (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0 + 2a0a:e5c0:13:e1:176b:eaa6:6d47:1c40/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0 + unicast [place7-server4 23:45:21.591] (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0 + unicast [place7-server3 23:45:21.591] (100) [AS65534i] + via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0 + unicast [place7-server2 23:45:21.589] (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0 + 2a0a:e5c0:13:e1:e0d1:d390:343e:8480/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0 + unicast [place7-server3 2021-06-05] (100) [AS65534i] + via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0 + unicast [place7-server4 2021-06-05] (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0 + unicast [place7-server2 2021-06-05] (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0 + 2a0a:e5c0:13::/48 unreachable [v6 2021-05-16] * (200) + 2a0a:e5c0:13:e1:9b19:7142:bebb:4d80/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0 + unicast [place7-server3 2021-06-05] (100) [AS65534i] + via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0 + unicast [place7-server4 2021-06-05] (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0 + unicast [place7-server2 2021-06-05] (100) [AS65534i] + via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0 + bird> + + +### What doesn't work + +* Rook does not format/spinup all disks +* Deleting all rook components fails (**kubectl delete -f cluster.yaml + hangs** forever) +* Spawning VMs fails with **error: unable to recognize "vmi.yaml": no matches for kind "VirtualMachineInstance" in version "kubevirt.io/v1"** + + +[[!tag kubernetes ipv6]]