www.nico.schottelius.org/blog/k8s-ipv6-only-cluster.mdwn

[[!meta title="Building an IPv6 only kubernetes cluster"]]

## Introduction

For a few weeks I am working on my pet project to create a production
ready kubernetes cluster that runs in an IPv6 only environment.

As the complexity and challenges for this project are rather
interesting, I decided to start documenting them in this blog post.

The
[ungleich-k8s](https://code.ungleich.ch/ungleich-public/ungleich-k8s)
contanins all snippets and latest code.


## Objective

The kubernetes cluster should support the following work loads:

* Matrix Chat instances (Synapse+postgres+nginx+element)
* Virtual Machines (via kubevirt)
* Provide storage to internal and external consumers using Ceph

## Components

The following is a list of components that I am using so far. This
might change on the way, but I wanted to list already what I selected
and why.

### OS: Alpine Linux

The operating system of choice to run the k8s cluster is
[Alpine Linux](https://www.alpinelinux.org/) as it is small, stable
and supports both docker and cri-o.

### Container management: docker

Originally I started with [cri-o](https://cri-o.io/). However using
cri-o together with kubevirt and calico results in an overlayfs placed
on / of the host, which breaks the full host functionality (see below
for details).

Docker, while being deprecated, allows me to get kubevirt generally
speaking running.

### Networking: IPv6 only, calico

I wanted to go with [cilium](https://cilium.io/) first, because it
goes down the eBPF route from the get go. However cilium does not yet
contain native and automated BGP peering with the upstream
infrastructure, so managing nodes / ip network peering becomes a
tedious, manual and error prone task. Cilium is on the way to improve
this, but is not there yet.

[Calico](https://www.projectcalico.org/) on the other hand still
relies on ip(6)tables and kube-proxy for forwarding traffic, but has
for a long time proper BGP support. Calico also aims to add eBPF
support, however at the moment it does not support IPv6 yet (bummer!).

### Storage: rook

[Rook](https://rook.io/) seems to be the first choice if you search
who is doing what storage providers in the k8s world. It looks rather
proper, even though some knobs are not yet clear to me.

Rook, in my opinion, is a direct alternative of running cephadm, which
requires systemd running on your hosts. Which, given Alpine Linux,
will never be the case.

### Virtualisation

[Kubevirt](https://kubevirt.io/) seems to provide a good
interface. Mid term, kubevirt is projected to replace
[OpenNebula](https://opennebula.io/) at
[ungleich](https://ungleich.ch).


## Challenges

### cri-o + calico + kubevirt = broken host

So this is a rather funky one. If you deploy cri-o and calico,
everything works. If you then deploy kubevirt, the **virt-handler**
pod fails to come up with the error message

     Error: path "/var/run/kubevirt" is mounted on "/" but it is not a shared mount.

In the Internet there are two recommendations to fix this:

* Fix the systemd unit for docker: Obviously, using neither of them,
  this is not applicable...
* Issue **mount --make-shared /**

The second command has a very strange side effect: Issueing that, the
contents of a calico pod are mounted as an overlayfs **on / of the
host**. This covers /proc and thus things like **ps**, **mount** and
co. fail and basically the whole system becomes unusable until reboot.

This is fully reproducible. I first suspected the tmpfs on / to be the
issue, used some disks instead of booting over network to check it and
even a regular ext4 on / causes the exact same problem.

### docker + calico + kubevirt = other shared mounts

Now, given that cri-o + calico + kubevirt does not lead to the
expected result, what does the same setup with docker look like? The
calico node pods with docker fail to come up, if /sys is not
shared mounted, the virt-handler pods fail if /run is not shared
mounted.

Two funky findings:

Issueing the following commands makes both work:

    mount --make-shared /sys
    mount --make-shared /run

The paths are totally different between docker and cri-o, even though
the mapped hostpaths in the pod description are the same. And why is
having /sys not being shared not a problem for calico in cri-o?

## Log

### Status 2021-06-07

Today I have updated the ceph cluster definition in rook to

* check hosts every 10 minutes instead of 60m for new disks
* use IPv6 instead of IPv6

The succesful ceph -s output:

    [20:42] server47.place7:~/ungleich-k8s/rook# kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s
      cluster:
        id:     049110d9-9368-4750-b3d3-6ca9a80553d7
        health: HEALTH_WARN
                mons are allowing insecure global_id reclaim

      services:
        mon: 3 daemons, quorum a,b,d (age 75m)
        mgr: a(active, since 74m), standbys: b
        osd: 6 osds: 6 up (since 43m), 6 in (since 44m)

      data:
        pools:   2 pools, 33 pgs
        objects: 6 objects, 34 B
        usage:   37 MiB used, 45 GiB / 45 GiB avail
        pgs:     33 active+clean


The result is a working ceph clusters with RBD support. I also applied
the cephfs manifest, however RWX volumes (readwritemany) are not yet
spinning up. It seems that test [helm charts](https://artifacthub.io/)
often require RWX instead of RWO (readwriteonce) access.

Also the ceph dashboard does not come up, even though it is
configured:

    [20:44] server47.place7:~# kubectl -n rook-ceph get svc
    NAME                       TYPE        CLUSTER-IP              EXTERNAL-IP   PORT(S)             AGE
    csi-cephfsplugin-metrics   ClusterIP   2a0a:e5c0:13:e2::760b   <none>        8080/TCP,8081/TCP   82m
    csi-rbdplugin-metrics      ClusterIP   2a0a:e5c0:13:e2::482d   <none>        8080/TCP,8081/TCP   82m
    rook-ceph-mgr              ClusterIP   2a0a:e5c0:13:e2::6ab9   <none>        9283/TCP            77m
    rook-ceph-mgr-dashboard    ClusterIP   2a0a:e5c0:13:e2::5a14   <none>        7000/TCP            77m
    rook-ceph-mon-a            ClusterIP   2a0a:e5c0:13:e2::c39e   <none>        6789/TCP,3300/TCP   83m
    rook-ceph-mon-b            ClusterIP   2a0a:e5c0:13:e2::732a   <none>        6789/TCP,3300/TCP   81m
    rook-ceph-mon-d            ClusterIP   2a0a:e5c0:13:e2::c658   <none>        6789/TCP,3300/TCP   76m
    [20:44] server47.place7:~# curl http://[2a0a:e5c0:13:e2::5a14]:7000
    curl: (7) Failed to connect to 2a0a:e5c0:13:e2::5a14 port 7000: Connection refused
    [20:45] server47.place7:~#

The ceph mgr is perfectly reachable though:

    [20:45] server47.place7:~# curl -s http://[2a0a:e5c0:13:e2::6ab9]:9283/metrics | head

    # HELP ceph_health_status Cluster health status
    # TYPE ceph_health_status untyped
    ceph_health_status 1.0
    # HELP ceph_mon_quorum_status Monitors in quorum
    # TYPE ceph_mon_quorum_status gauge
    ceph_mon_quorum_status{ceph_daemon="mon.a"} 1.0
    ceph_mon_quorum_status{ceph_daemon="mon.b"} 1.0
    ceph_mon_quorum_status{ceph_daemon="mon.d"} 1.0
    # HELP ceph_fs_metadata FS Metadata


### Status 2021-06-06

Today is the first day of publishing the findings and this blog
article will lack quite some information. If you are curious and want
to know more that is not yet published, you can find me on Matrix
in the **#hacking:ungleich.ch** room.

#### What works so far

* Spawing pods IPv6 only
* Spawing IPv6 only services works
* BGP Peering and ECMP routes with the upstream infrastructure works

Here's an output of the upstream bird process for the routes from k8s:

    bird> show route
    Table master6:
    2a0a:e5c0:13:e2::/108 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
                         unicast [place7-server3 2021-06-05] (100) [AS65534i]
            via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
                         unicast [place7-server4 2021-06-05] (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
                         unicast [place7-server2 2021-06-05] (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
    2a0a:e5c0:13:e1:176b:eaa6:6d47:1c40/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
                         unicast [place7-server4 23:45:21.591] (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
                         unicast [place7-server3 23:45:21.591] (100) [AS65534i]
            via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
                         unicast [place7-server2 23:45:21.589] (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
    2a0a:e5c0:13:e1:e0d1:d390:343e:8480/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
                         unicast [place7-server3 2021-06-05] (100) [AS65534i]
            via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
                         unicast [place7-server4 2021-06-05] (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
                         unicast [place7-server2 2021-06-05] (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
    2a0a:e5c0:13::/48    unreachable [v6 2021-05-16] * (200)
    2a0a:e5c0:13:e1:9b19:7142:bebb:4d80/122 unicast [place7-server1 23:45:21.589] * (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:3554 on eth0
                         unicast [place7-server3 2021-06-05] (100) [AS65534i]
            via 2a0a:e5c0:13:0:224:81ff:fee0:db7a on eth0
                         unicast [place7-server4 2021-06-05] (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:3564 on eth0
                         unicast [place7-server2 2021-06-05] (100) [AS65534i]
            via 2a0a:e5c0:13:0:225:b3ff:fe20:38cc on eth0
    bird>


#### What doesn't work

* Rook does not format/spinup all disks
* Deleting all rook components fails (**kubectl delete -f cluster.yaml
  hangs** forever)
* Spawning VMs fails with **error: unable to recognize "vmi.yaml": no matches for kind "VirtualMachineInstance" in version "kubevirt.io/v1"**


[[!tag kubernetes ipv6]]