Commit 251677cf authored by Nico Schottelius's avatar Nico Schottelius

++ three new blog articles

parent 5d05a28e
Pipeline #3574 passed with stages
in 2 minutes and 47 seconds
title: Active-Active Routing Paths in Data Center Light
pub_date: 2019-11-08
author: Nico Schottelius
twitter_handle: NicoSchottelius
_hidden: no
_discoverable: no
From our last two blog articles (a, b) you probably already know that
it is spring network cleanup in [Data Center Light](
In [first blog article]() we described where we started and in
the [second blog article]() you could see how we switched our
infrastructure to IPv6 only netboot.
In this article we will dive a bit more into the details of our
network architecture and which problems we face with active-active
## Network architecture
Let's have a look at a simplified (!) diagram of the network:
Doesn't look that simple, does it? Let's break it down into small
## Upstream routers
We have a set of **upstream routers** which work stateless. They don't
have any stateful firewall rules, so both of them can work actively
without state synchronisation. Moreover, both of them peer with the
data center upstreams. These are fast routers and besides forwarding,
they also do **BGP peering** with our upstreams.
Over all the upstream routers are very simple machines, mostly running
bird and forwarding packets all day. They also provide a DNS service
(resolving and authoritative), because they are always up and can
announce service IPs via BGP or via OSPF to our network.
## Internal routers
The internal routers on the other hand provide **stateful routing**,
**IP address assignments** and **netboot services**. They are a bit
more complicated compared to the upstream routers, but they care only
a small routing table.
## Communication between the routers
All routers employ OSPF and BGP for route exchange. Thus the two
upstream routers learn about the internal networks (IPv6 only, as
usual) from the internal routers.
## Sessions
Sessions in networking are almost always an evil. You need to store
them (at high speed), you need to maintain them (updating, deleting)
and if you run multiple routers, you even need to sychronise them.
In our case the internal routers do have session handling, as they are
providing a stateful firewall. As we are using a multi router setup,
things can go really wrong if the wrong routes are being used.
Let's have a look at this a bit more in detail.
## The good path
IMAGE2: good
If a server sends out a packet via router1 and router1 eventually
receives the answer, everything is fine. The returning packet matches
the state entry that was created by the outgoing packet and the
internal router forwards the packet.
## The bad path
IMAGE3: bad
However if the
## Routing paths
If we want to go active-active routing, the server can choose between
either internal router for sending out the packet. The internal
routers again have two upstream routers. So with the return path
included, the following paths exist for a packet:
Outgoing paths:
* servers->router1->upstream router1->internet
* servers->router1->upstream router2->internet
* servers->router2->upstream router1->internet
* servers->router2->upstream router2->internet
And the returning paths are:
* internet->upstream router1->router 1->servers
* internet->upstream router1->router 2->servers
* internet->upstream router2->router 1->servers
* internet->upstream router2->router 2->servers
So on average, 50% of the routes will hit the right router on
return. However servers as well as upstream routers are not using load
balancing like ECMP, so once an incorrect path has been chosen, the
packet loss is 100%.
## Session synchronisation
In the first article we talked a bit about keepalived and that
it helps to operate routers in an active-passive mode. This did not
turn out to be the most reliable method. Can we do better with
active-active routers and session synchronisation?
Linux supports this using
[conntrackd]( However,
conntrackd supports active-active routers on a **flow based** level,
but not on a **packet** based level. The difference is that the
following will not work in active-active routers with conntrackd:
#1 Packet (in the original direction) updates state in Router R1 ->
submit state to R2
#2 Packet (in the reply direction) arrive to Router R2 before state
coming from R1 has been digested.
With strict stateful filtering, Packet #2 will be dropped and it will
trigger a retransmission.
(quote from Pablo Neira Ayuso, see below for more details)
Some of you will mumble something like **latency** in their head right
now. If the return packet is guaranteed to arrive after state
synchronisation, then everything is fine, However, if the reply is
faster than the state synchronisation, packets will get dropped.
In reality, this will work for packets coming and going to the
Internet. However, in our setup the upstream routers are route between
different data center locations, which are in the sub micro second
latency area - i.e. lan speed, because they are interconnected with
dark fiber links.
## Take away
Before moving on to the next blog article, we would like to express
our thanks to Pablo Neira Ayuso, who gave very important input for
session based firewalls and session synchronisation.
So active-active routing seems not to have a straight forward
solution. Read in the [next blog article](/) on how we solved the
challenge in the end.
title: IPv6 only netboot in Data Center Light
pub_date: 2021-05-01
author: Nico Schottelius
twitter_handle: NicoSchottelius
_hidden: no
_discoverable: no
How we switched from IPv4 netboot to IPv6 netboot
In our [previous blog
we wrote about our motivation for the
big spring network cleanup. In this blog article we show how we
started reducing the complexity by removing our dependency on IPv4.
## IPv6 first
When you found our blog, you are probably aware: everything at
ungleich is IPv6 first. Many of our networks are IPv6 only, all DNS
entries for remote access have IPv6 (AAAA) entries and there are only
rare exceptions when we utilise IPv4.
## IPv4 only Netboot
One of the big exceptions to this paradigm used to be how we boot our
servers. Because our second big paradigm is sustainability, we use a
lot of 2nd (or 3rd) generation hardware. We actually share this
passion with our friends from
[e-durable](, because sustainability is
something that we need to employ today and not tomorrow.
But back to the netbooting topic: For netbooting we mainly
relied on onboard network cards so far.
## Onboard network cards
We used these network cards for multiple reasons:
* they exist virtually in any server
* they usually have a ROM containing a PXE capable firmware
* it allows us to split real traffic to fiber cards and internal traffic
However using the onboard devices comes also with a couple of disadvantages:
* Their ROM is often outdated
* It requires additional cabling
## Cables
Let's have a look at the cabling situation first. Virtually all of
our servers are connected to the network using 2x 10 Gbit/s fiber cards.
On one side this provides a fast connection, but on the other side
it provides us with something even better: distances.
Our data centers employ a non-standard design due to the re-use of
existing factory halls. This means distances between servers and
switches can be up to 100m. With fiber, we can easily achieve these
Additionally, have less cables provides a simpler infrastructure
that is easier to analyse.
## Reducing complexity 1
So can we somehow get rid of the copper cables and switch to fiber
only? It turns out that the fiber cards we use (mainly Intel X520's)
have their own ROM. So we started disabling the onboard network cards
and tried booting from the fiber cards. This worked until we wanted to
move the lab setup to production...
## Bonding (LACP) and VLAN tagging
Our servers use bonding (802.3ad) for redundant connections to the
switches and VLAN tagging on top of the bonded devices to isolate
client traffic. On the switch side we realised this using
configurations like
interface Port-Channel33
switchport mode trunk
mlag 33
interface Ethernet33
channel-group 33 mode active
But that does not work, if the network ROM at boot does not create an
LACP enabled link on top of which it should be doing VLAN tagging.
The ROM in our network cards **would** have allowed VLAN tagging alone
To fix this problem, we reconfigured our switches as follows:
interface Port-Channel33
switchport trunk native vlan 10
switchport mode trunk
port-channel lacp fallback static
port-channel lacp fallback timeout 20
mlag 33
This basically does two things:
* If there are no LACP frames, fallback to static (non lacp)
* Accept untagged traffic and map it to VLAN 10 (one of our boot networks)
Great, our servers can now netboot from fiber! But we are not done
## IPv6 only netbooting
So how do we convince these network cards to do IPv6 netboot? Can we
actually do that at all? Our first approach was to put a custom build of
[ipxe]( on a USB stick. We generated that
ipxe image using **** script
from the
repository. Turns out using a USB stick works pretty well for most
## ROMs are not ROMs
As you can imagine, the ROM of the X520 cards does not contain IPv6
netboot support. So are we back at square 1? No, we are not. Because
the X520's have something that the onboard devices did not
consistently have: **a rewritable memory area**.
Let's take 2 steps back here first: A ROM is an **read only memory**
chip. Emphasis on **read only**. However, modern network cards and a
lot of devices that support on-device firmware do actually have a
memory (flash) area that can be written to. And that is what aids us
in our situation.
## ipxe + flbtool + x520 = fun
Trying to write ipxe into the X520 cards initially failed, because the
network card did not recognise the format of the ipxe rom file.
Luckily the folks in the ipxe community already spotted that problem
AND fixed it: The format used in these cards is called FLB. And there
is [flbtool](, which allows you
to wrap the ipxe rom file into the FLB format. For those who want to
try it yourself (at your own risk!), it basically involves:
* Get the current ROM from the card (try bootutil64e)
* Extract the contents from the rom using flbtool
* This will output some sections/parts
* Locate one part that you want to overwrite with iPXE (a previous PXE
section is very suitable)
* Replace the .bin file with your iPXE rom
* Adjust the .json file to match the length of the new binary
* Build a new .flb file using flbtool
* Flash it onto the card
While this is a bit of work, it is worth it for us, because...:
## IPv6 only netboot over fiber
With the modified ROM, basically loading iPXE at start, we can now
boot our servers in IPv6 only networks. On our infrastructure side, we
added two **tiny** things:
We use ISC dhcp with the following configuration file:
option dhcp6.bootfile-url code 59 = string;
option dhcp6.bootfile-url "http://[2a0a:e5c0:0:6::46]/ipxescript";
subnet6 2a0a:e5c0:0:6::/64 {}
(that is the complete configuration!)
And we used radvd to announce that there are other information,
indicating clients can actually query the dhcpv6 server:
interface bond0.10
AdvSendAdvert on;
MinRtrAdvInterval 3;
MaxRtrAdvInterval 5;
AdvDefaultLifetime 600;
# IPv6 netbooting
AdvOtherConfigFlag on;
prefix 2a0a:e5c0:0:6::/64 { };
RDNSS 2a0a:e5c0:0:a::a 2a0a:e5c0:0:a::b { AdvRDNSSLifetime 6000; };
DNSSL { AdvDNSSLLifetime 6000; } ;
## Take away
Being able to reduce cables was one big advantage in the beginning.
Switching to IPv6 only netboot does not seem like a big simplification
in the first place, besides being able to remove IPv4 in server
However as you will see in
[the next blog posts](/u/blog/datacenterlight-active-active-routing/),
switching to IPv6 only netbooting is actually a key element on
reducing complexity in our network.
title: Data Center Light: Spring network cleanup
pub_date: 2021-05-01
author: Nico Schottelius
twitter_handle: NicoSchottelius
_hidden: no
_discoverable: no
From today on ungleich offers free, encrypted IPv6 VPNs for hackerspaces
## Introduction
Spring is the time for cleanup. Cleanup up your apartment, removing
dust from the cabinet, letting the light shine through the windows,
or like in our case: improving the networking situation.
In this article we give an introduction of where we started and what
the typical setup used to be in our data center.
## Best practice
When we started [Data Center Light]( in
2017, we orientated ourselves at "best practice" for networking. We
started with IPv6 only networks and used RFC1918 network (10/8) for
internal IPv4 routing.
And we started with 2 routers for every network to provide
## Router redundancy
So what do you do when you have two routers? In the Linux world the
software [keepalived](
is very popular to provide redundant routing
using the [VRRP protocol](
## Active-Passive
While VRRP is designed to allow multiple (not only two) routers to
co-exist in a network, its design is basically active-passive: you
have one active router and n passive routers, in our case 1
## Keepalived: a closer look
A typical keepalived configuration in our network looked like this:
vrrp_instance router_v4 {
interface INTERFACE
virtual_router_id 2
priority PRIORITY
advert_int 1
virtual_ipaddress { dev eth1.5 # Internal
notify_backup "/usr/local/bin/"
notify_fault "/usr/local/bin/"
notify_master "/usr/local/bin/"
vrrp_instance router_v6 {
interface INTERFACE
virtual_router_id 1
priority PRIORITY
advert_int 1
virtual_ipaddress {
2a0a:e5c0:1:8::48/128 dev eth1.8 # Transfer for routing from outside
2a0a:e5c0:0:44::7/64 dev bond0.18 # zhaw
2a0a:e5c0:2:15::7/64 dev bond0.20 #
This is a template that we distribute via [cdist](https:/ The
strings INTERFACE and PRIORITY are replaced via cdist. The interface
field defines which interface to use for VRRP communication and the
priority field determines which of the routers is the active one.
So far, so good. However let's have a look at a tiny detail of this
configuration file:
notify_backup "/usr/local/bin/"
notify_fault "/usr/local/bin/"
notify_master "/usr/local/bin/"
These three lines basically say: "start something if you are the
master" and "stop something in case you are not". And why did we do
this? Because of stateful services.
## Stateful services
A typical shell script that we would call containes lines like this:
/etc/init.d/radvd stop
/etc/init.d/dhcpd stop
(or start in the case of the master version)
In earlier days, this even contained openvpn, which was running on our
first generation router version. But more about OpenVPN later.
The reason why we stopped and started dhcp and radvd is to make
clients of the network use the active router. We used radvd to provide
IPv6 addresses as the primary access method to servers. And we used
dhcp mainly to allow servers to netboot. The active router would
carry state (firewall!) and thus the flow of packets always need to go
through the active router.
Restarting radvd on a different machine keeps the IPv6 addresses the
same, as clients assign then themselves using EUI-64. In case of dhcp
(IPv4) we would have used hardcoded IPv4 addresses using a mapping of
MAC address to IPv4 address, but we opted out for this. The main
reason is that dhcp clients re-request their same leas and even if an
IPv4 addresses changes, it is not really of importance.
During a failover this would lead to a few seconds interrupt and
re-establishing sessions. Given that routers are usually rather stable
and restarting them is not a daily task, we initially accepted this.
## Keepalived/VRRP changes
One of the more tricky things is changes to keepalived. Because
keepalived uses the *number of addresses and routes* to verify
that the received VRRP packet matches its configuration, adding or
deleting IP addresses and routes, causes a problem:
While one router was updated, the number of IP addresses or routes is
different. This causes both routers to ignore the others VRRP messages
and both routers think they should be the master process.
This leads to the problem that both routers receive client and outside
traffic. This causes the firewall (nftables) to not recognise
returning packets, if they were sent out by router1, but received back
by router2 and, because nftables is configured *stateful*, will drop
the returning packet.
However not only changes to the configuration can trigger this
problem, but also any communication problem between the two
routers. Since 2017 we experienced it multiple times that keepalived
was unable to receive or send messages from the other router and thus
both of them again became the master process.
## Take away
While in theory keepalived should improve the reliability, in practice
the number of problems due to double master situations we had, made us
question whether the keepalived concept is the fitting one for us.
You can read how we evolved from this setup in
[the next blog article](/u/blog/datacenterlight-ipv6-only-netboot/).
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment