From 251677cf5765c208716c0ef4baeb77560b958f15 Mon Sep 17 00:00:00 2001
From: Nico Schottelius <nico@nico-notebook.schottelius.org>
Date: Sat, 1 May 2021 11:34:11 +0200
Subject: [PATCH] ++ three new blog articles

---
 .../contents.lr                               | 161 +++++++++++++
 .../contents.lr                               | 219 ++++++++++++++++++
 .../contents.lr                               | 161 +++++++++++++
 3 files changed, 541 insertions(+)
 create mode 100644 content/u/blog/datacenterlight-active-active-routing/contents.lr
 create mode 100644 content/u/blog/datacenterlight-ipv6-only-netboot/contents.lr
 create mode 100644 content/u/blog/datacenterlight-spring-network-cleanup/contents.lr

diff --git a/content/u/blog/datacenterlight-active-active-routing/contents.lr b/content/u/blog/datacenterlight-active-active-routing/contents.lr
new file mode 100644
index 0000000..b203cb4
--- /dev/null
+++ b/content/u/blog/datacenterlight-active-active-routing/contents.lr
@@ -0,0 +1,161 @@
+title: Active-Active Routing Paths in Data Center Light
+---
+pub_date: 2019-11-08
+---
+author: Nico Schottelius
+---
+twitter_handle: NicoSchottelius
+---
+_hidden: no
+---
+_discoverable: no
+---
+abstract:
+
+---
+body:
+
+From our last two blog articles (a, b) you probably already know that
+it is spring network cleanup in [Data Center Light](https://datacenterlight.ch).
+
+In [first blog article]() we described where we started and in
+the [second blog article]() you could see how we switched our
+infrastructure to IPv6 only netboot.
+
+In this article we will dive a bit more into the details of our
+network architecture and which problems we face with active-active
+routers.
+
+## Network architecture
+
+Let's have a look at a simplified (!) diagram of the network:
+
+... IMAGE
+
+Doesn't look that simple, does it? Let's break it down into small
+pieces.
+
+## Upstream routers
+
+We have a set of **upstream routers** which work stateless. They don't
+have any stateful firewall rules, so both of them can work actively
+without state synchronisation. Moreover, both of them peer with the
+data center upstreams. These are fast routers and besides forwarding,
+they also do **BGP peering** with our upstreams.
+
+Over all the upstream routers are very simple machines, mostly running
+bird and forwarding packets all day. They also provide a DNS service
+(resolving and authoritative), because they are always up and can
+announce service IPs via BGP or via OSPF to our network.
+
+## Internal routers
+
+The internal routers on the other hand provide **stateful routing**,
+**IP address assignments** and **netboot services**. They are a bit
+more complicated compared to the upstream routers, but they care only
+a small routing table.
+
+## Communication between the routers
+
+All routers employ OSPF and BGP for route exchange. Thus the two
+upstream routers learn about the internal networks (IPv6 only, as
+usual) from the internal routers.
+
+## Sessions
+
+Sessions in networking are almost always an evil. You need to store
+them (at high speed), you need to maintain them (updating, deleting)
+and if you run multiple routers, you even need to sychronise them.
+
+In our case the internal routers do have session handling, as they are
+providing a stateful firewall. As we are using a multi router setup,
+things can go really wrong if the wrong routes are being used.
+
+Let's have a look at this a bit more in detail.
+
+## The good path
+
+IMAGE2: good
+
+If a server sends out a packet via router1 and router1 eventually
+receives the answer, everything is fine. The returning packet matches
+the state entry that was created by the outgoing packet and the
+internal router forwards the packet.
+
+## The bad path
+
+IMAGE3: bad
+
+However if the
+
+## Routing paths
+
+If we want to go active-active routing, the server can choose between
+either internal router for sending out the packet. The internal
+routers again have two upstream routers. So with the return path
+included, the following paths exist for a packet:
+
+Outgoing paths:
+
+* servers->router1->upstream router1->internet
+* servers->router1->upstream router2->internet
+* servers->router2->upstream router1->internet
+* servers->router2->upstream router2->internet
+
+And the returning paths are:
+
+* internet->upstream router1->router 1->servers
+* internet->upstream router1->router 2->servers
+* internet->upstream router2->router 1->servers
+* internet->upstream router2->router 2->servers
+
+So on average, 50% of the routes will hit the right router on
+return. However servers as well as upstream routers are not using load
+balancing like ECMP, so once an incorrect path has been chosen, the
+packet loss is 100%.
+
+## Session synchronisation
+
+In the first article we talked a bit about keepalived and that
+it helps to operate routers in an active-passive mode. This did not
+turn out to be the most reliable method. Can we do better with
+active-active routers and session synchronisation?
+
+Linux supports this using
+[conntrackd](http://conntrack-tools.netfilter.org/). However,
+conntrackd supports active-active routers on a **flow based** level,
+but not on a **packet** based level. The difference is that the
+following will not work in active-active routers with conntrackd:
+
+```
+#1 Packet (in the original direction) updates state in Router R1 ->
+   submit state to R2
+#2 Packet (in the reply direction) arrive to Router R2 before state
+   coming from R1 has been digested.
+
+With strict stateful filtering, Packet #2 will be dropped and it will
+trigger a retransmission.
+```
+(quote from Pablo Neira Ayuso, see below for more details)
+
+Some of you will mumble something like **latency** in their head right
+now. If the return packet is guaranteed to arrive after state
+synchronisation, then everything is fine, However, if the reply is
+faster than the state synchronisation, packets will get dropped.
+
+In reality, this will work for packets coming and going to the
+Internet. However, in our setup the upstream routers are route between
+different data center locations, which are in the sub micro second
+latency area - i.e. lan speed, because they are interconnected with
+dark fiber links.
+
+
+## Take away
+
+Before moving on to the next blog article, we would like to express
+our thanks to Pablo Neira Ayuso, who gave very important input for
+session based firewalls and session synchronisation.
+
+So active-active routing seems not to have a straight forward
+solution. Read in the [next blog article](/) on how we solved the
+challenge in the end.
diff --git a/content/u/blog/datacenterlight-ipv6-only-netboot/contents.lr b/content/u/blog/datacenterlight-ipv6-only-netboot/contents.lr
new file mode 100644
index 0000000..0d42d34
--- /dev/null
+++ b/content/u/blog/datacenterlight-ipv6-only-netboot/contents.lr
@@ -0,0 +1,219 @@
+title: IPv6 only netboot in Data Center Light
+---
+pub_date: 2021-05-01
+---
+author: Nico Schottelius
+---
+twitter_handle: NicoSchottelius
+---
+_hidden: no
+---
+_discoverable: no
+---
+abstract:
+How we switched from IPv4 netboot to IPv6 netboot
+---
+body:
+
+In our [previous blog
+article](/u/blog/datacenterlight-spring-network-cleanup)
+ we wrote about our motivation for the
+big spring network cleanup. In this blog article we show how we
+started reducing the complexity by removing our dependency on IPv4.
+
+## IPv6 first
+
+When you found our blog, you are probably aware: everything at
+ungleich is IPv6 first. Many of our networks are IPv6 only, all DNS
+entries for remote access have IPv6 (AAAA) entries and there are only
+rare exceptions when we utilise IPv4.
+
+## IPv4 only Netboot
+
+One of the big exceptions to this paradigm used to be how we boot our
+servers. Because our second big paradigm is sustainability, we use a
+lot of 2nd (or 3rd) generation hardware. We actually share this
+passion with our friends from
+[e-durable](https://recycled.cloud/), because sustainability is
+something that we need to employ today and not tomorrow.
+But back to the netbooting topic: For netbooting we mainly
+relied on onboard network cards so far.
+
+## Onboard network cards
+
+We used these network cards for multiple reasons:
+
+* they exist virtually in any server
+* they usually have a ROM containing a PXE capable firmware
+* it allows us to split real traffic to fiber cards and internal traffic
+
+However using the onboard devices comes also with a couple of disadvantages:
+
+* Their ROM is often outdated
+* It requires additional cabling
+
+## Cables
+
+Let's have a look at the cabling situation first. Virtually all of
+our servers are connected to the network using 2x 10 Gbit/s fiber cards.
+
+On one side this provides a fast connection, but on the other side
+it provides us with something even better: distances.
+
+Our data centers employ a non-standard design due to the re-use of
+existing factory halls. This means distances between servers and
+switches can be up to 100m. With fiber, we can easily achieve these
+distances.
+
+Additionally, have less cables provides a simpler infrastructure
+that is easier to analyse.
+
+## Reducing complexity 1
+
+So can we somehow get rid of the copper cables and switch to fiber
+only? It turns out that the fiber cards we use (mainly Intel X520's)
+have their own ROM. So we started disabling the onboard network cards
+and tried booting from the fiber cards. This worked until we wanted to
+move the lab setup to production...
+
+## Bonding (LACP) and VLAN tagging
+
+Our servers use bonding (802.3ad) for redundant connections to the
+switches and VLAN tagging on top of the bonded devices to isolate
+client traffic. On the switch side we realised this using
+configurations like
+
+```
+interface Port-Channel33
+   switchport mode trunk
+   mlag 33
+
+...
+interface Ethernet33
+   channel-group 33 mode active
+```
+
+But that does not work, if the network ROM at boot does not create an
+LACP enabled link on top of which it should be doing VLAN tagging.
+
+The ROM in our network cards **would** have allowed VLAN tagging alone
+though.
+
+To fix this problem, we reconfigured our switches as follows:
+
+```
+interface Port-Channel33
+   switchport trunk native vlan 10
+   switchport mode trunk
+   port-channel lacp fallback static
+   port-channel lacp fallback timeout 20
+   mlag 33
+```
+
+This basically does two things:
+
+* If there are no LACP frames, fallback to static (non lacp)
+  configuration
+* Accept untagged traffic and map it to VLAN 10 (one of our boot networks)
+
+Great, our servers can now netboot from fiber! But we are not done
+yet...
+
+## IPv6 only netbooting
+
+So how do we convince these network cards to do IPv6 netboot? Can we
+actually do that at all? Our first approach was to put a custom build of
+[ipxe](https://ipxe.org/) on a USB stick. We generated that
+ipxe image using **rebuild-ipxe.sh** script
+from the
+[ungleich-tools](https://code.ungleich.ch/ungleich-public/ungleich-tools)
+repository. Turns out using a USB stick works pretty well for most
+situations.
+
+## ROMs are not ROMs
+
+As you can imagine, the ROM of the X520 cards does not contain IPv6
+netboot support. So are we back at square 1? No, we are not. Because
+the X520's have something that the onboard devices did not
+consistently have: **a rewritable memory area**.
+
+Let's take 2 steps back here first: A ROM is an **read only memory**
+chip. Emphasis on **read only**. However, modern network cards and a
+lot of devices that support on-device firmware do actually have a
+memory (flash) area that can be written to. And that is what aids us
+in our situation.
+
+## ipxe + flbtool + x520 = fun
+
+Trying to write ipxe into the X520 cards initially failed, because the
+network card did not recognise the format of the ipxe rom file.
+
+Luckily the folks in the ipxe community already spotted that problem
+AND fixed it: The format used in these cards is called FLB. And there
+is [flbtool](https://github.com/devicenull/flbtool/), which allows you
+to wrap the ipxe rom file into the FLB format. For those who want to
+try it yourself (at your own risk!), it basically involves:
+
+* Get the current ROM from the card (try bootutil64e)
+* Extract the contents from the rom using flbtool
+* This will output some sections/parts
+* Locate one part that you want to overwrite with iPXE (a previous PXE
+  section is very suitable)
+* Replace the .bin file with your iPXE rom
+* Adjust the .json file to match the length of the new binary
+* Build a new .flb file using flbtool
+* Flash it onto the card
+
+While this is a bit of work, it is worth it for us, because...:
+
+## IPv6 only netboot over fiber
+
+With the modified ROM, basically loading iPXE at start, we can now
+boot our servers in IPv6 only networks. On our infrastructure side, we
+added two **tiny** things:
+
+We use ISC dhcp with the following configuration file:
+
+```
+option dhcp6.bootfile-url code 59 = string;
+
+option dhcp6.bootfile-url "http://[2a0a:e5c0:0:6::46]/ipxescript";
+
+subnet6 2a0a:e5c0:0:6::/64 {}
+```
+
+(that is the complete configuration!)
+
+And we used radvd to announce that there are other information,
+indicating clients can actually query the dhcpv6 server:
+
+```
+interface bond0.10
+{
+  AdvSendAdvert on;
+  MinRtrAdvInterval 3;
+  MaxRtrAdvInterval 5;
+  AdvDefaultLifetime 600;
+
+  # IPv6 netbooting
+  AdvOtherConfigFlag on;
+
+  prefix 2a0a:e5c0:0:6::/64      { };
+
+  RDNSS 2a0a:e5c0:0:a::a 2a0a:e5c0:0:a::b  { AdvRDNSSLifetime 6000; };
+  DNSSL place5.ungleich.ch {  AdvDNSSLLifetime 6000; } ;
+};
+```
+
+## Take away
+
+Being able to reduce cables was one big advantage in the beginning.
+
+Switching to IPv6 only netboot does not seem like a big simplification
+in the first place, besides being able to remove IPv4 in server
+networks.
+
+However as you will see in
+[the next blog posts](/u/blog/datacenterlight-active-active-routing/),
+switching to IPv6 only netbooting is actually a key element on
+reducing complexity in our network.
diff --git a/content/u/blog/datacenterlight-spring-network-cleanup/contents.lr b/content/u/blog/datacenterlight-spring-network-cleanup/contents.lr
new file mode 100644
index 0000000..073b4b1
--- /dev/null
+++ b/content/u/blog/datacenterlight-spring-network-cleanup/contents.lr
@@ -0,0 +1,161 @@
+title: Data Center Light: Spring network cleanup
+---
+pub_date: 2021-05-01
+---
+author: Nico Schottelius
+---
+twitter_handle: NicoSchottelius
+---
+_hidden: no
+---
+_discoverable: no
+---
+abstract:
+From today on ungleich offers free, encrypted IPv6 VPNs for hackerspaces
+---
+body:
+
+## Introduction
+
+Spring is the time for cleanup. Cleanup up your apartment, removing
+dust from the cabinet, letting the light shine through the windows,
+or like in our case: improving the networking situation.
+
+In this article we give an introduction of where we started and what
+the typical setup used to be in our data center.
+
+## Best practice
+
+When we started [Data Center Light](https://datacenterlight.ch) in
+2017, we orientated ourselves at "best practice" for networking. We
+started with IPv6 only networks and used RFC1918 network (10/8) for
+internal IPv4 routing.
+
+And we started with 2 routers for every network to provide
+redundancy.
+
+## Router redundancy
+
+So what do you do when you have two routers? In the Linux world the
+software [keepalived](https://keepalived.org/)
+is very popular to provide redundant routing
+using the [VRRP protocol](https://en.wikipedia.org/wiki/Virtual_Router_Redundancy_Protocol).
+
+## Active-Passive
+
+While VRRP is designed to allow multiple (not only two) routers to
+co-exist in a network, its design is basically active-passive: you
+have one active router and n passive routers, in our case 1
+additional.
+
+## Keepalived: a closer look
+
+A typical keepalived configuration in our network looked like this:
+
+```
+vrrp_instance router_v4 {
+    interface INTERFACE
+    virtual_router_id 2
+    priority PRIORITY
+    advert_int 1
+    virtual_ipaddress {
+        10.0.0.1/22 dev  eth1.5      # Internal
+    }
+    notify_backup "/usr/local/bin/vrrp_notify_backup.sh"
+    notify_fault "/usr/local/bin/vrrp_notify_fault.sh"
+    notify_master "/usr/local/bin/vrrp_notify_master.sh"
+}
+
+vrrp_instance router_v6 {
+    interface INTERFACE
+    virtual_router_id 1
+    priority PRIORITY
+    advert_int 1
+    virtual_ipaddress {
+        2a0a:e5c0:1:8::48/128  dev eth1.8 # Transfer for routing from outside
+        2a0a:e5c0:0:44::7/64  dev bond0.18 # zhaw
+        2a0a:e5c0:2:15::7/64 dev bond0.20 #
+    }
+}
+```
+
+This is a template that we distribute via [cdist](https:/cdi.st). The
+strings INTERFACE and PRIORITY are replaced via cdist. The interface
+field defines which interface to use for VRRP communication and the
+priority field determines which of the routers is the active one.
+
+So far, so good. However let's have a look at a tiny detail of this
+configuration file:
+
+```
+    notify_backup "/usr/local/bin/vrrp_notify_backup.sh"
+    notify_fault "/usr/local/bin/vrrp_notify_fault.sh"
+    notify_master "/usr/local/bin/vrrp_notify_master.sh"
+```
+
+These three lines basically say: "start something if you are the
+master" and "stop something in case you are not". And why did we do
+this? Because of stateful services.
+
+## Stateful services
+
+A typical shell script that we would call containes lines like this:
+
+```
+/etc/init.d/radvd stop
+/etc/init.d/dhcpd stop
+```
+(or start in the case of the master version)
+
+In earlier days, this even contained openvpn, which was running on our
+first generation router version. But more about OpenVPN later.
+
+The reason why we stopped and started dhcp and radvd is to make
+clients of the network use the active router. We used radvd to provide
+IPv6 addresses as the primary access method to servers.  And we used
+dhcp mainly to allow servers to netboot.  The active router would
+carry state (firewall!) and thus the flow of packets always need to go
+through the active router.
+
+Restarting radvd on a different machine keeps the IPv6 addresses the
+same, as clients assign then themselves using EUI-64. In case of dhcp
+(IPv4) we would have used hardcoded IPv4 addresses using a mapping of
+MAC address to IPv4 address, but we opted out for this. The main
+reason is that dhcp clients re-request their same leas and even if an
+IPv4 addresses changes, it is not really of importance.
+
+During a failover this would lead to a few seconds interrupt and
+re-establishing sessions. Given that routers are usually rather stable
+and restarting them is not a daily task, we initially accepted this.
+
+## Keepalived/VRRP changes
+
+One of the more tricky things is changes to keepalived. Because
+keepalived uses the *number of addresses and routes* to verify
+that the received VRRP packet matches its configuration, adding or
+deleting IP addresses and routes, causes a problem:
+
+While one router was updated, the number of IP addresses or routes is
+different. This causes both routers to ignore the others VRRP messages
+and both routers think they should be the master process.
+
+This leads to the problem that both routers receive client and outside
+traffic. This causes the firewall (nftables) to not recognise
+returning packets, if they were sent out by router1, but received back
+by router2 and, because nftables is configured *stateful*, will drop
+the returning packet.
+
+However not only changes to the configuration can trigger this
+problem, but also any communication problem between the two
+routers. Since 2017 we experienced it multiple times that keepalived
+was unable to receive or send messages from the other router and thus
+both of them again became the master process.
+
+## Take away
+
+While in theory keepalived should improve the reliability, in practice
+the number of problems due to double master situations we had, made us
+question whether the keepalived concept is the fitting one for us.
+
+You can read how we evolved from this setup in
+[the next blog article](/u/blog/datacenterlight-ipv6-only-netboot/).