ungleich-staticcms/content/u/blog/datacenterlight-spring-network-cleanup/contents.lr
2021-05-01 11:34:11 +02:00

161 lines
5.5 KiB
Markdown

title: Data Center Light: Spring network cleanup
---
pub_date: 2021-05-01
---
author: Nico Schottelius
---
twitter_handle: NicoSchottelius
---
_hidden: no
---
_discoverable: no
---
abstract:
From today on ungleich offers free, encrypted IPv6 VPNs for hackerspaces
---
body:
## Introduction
Spring is the time for cleanup. Cleanup up your apartment, removing
dust from the cabinet, letting the light shine through the windows,
or like in our case: improving the networking situation.
In this article we give an introduction of where we started and what
the typical setup used to be in our data center.
## Best practice
When we started [Data Center Light](https://datacenterlight.ch) in
2017, we orientated ourselves at "best practice" for networking. We
started with IPv6 only networks and used RFC1918 network (10/8) for
internal IPv4 routing.
And we started with 2 routers for every network to provide
redundancy.
## Router redundancy
So what do you do when you have two routers? In the Linux world the
software [keepalived](https://keepalived.org/)
is very popular to provide redundant routing
using the [VRRP protocol](https://en.wikipedia.org/wiki/Virtual_Router_Redundancy_Protocol).
## Active-Passive
While VRRP is designed to allow multiple (not only two) routers to
co-exist in a network, its design is basically active-passive: you
have one active router and n passive routers, in our case 1
additional.
## Keepalived: a closer look
A typical keepalived configuration in our network looked like this:
```
vrrp_instance router_v4 {
interface INTERFACE
virtual_router_id 2
priority PRIORITY
advert_int 1
virtual_ipaddress {
10.0.0.1/22 dev eth1.5 # Internal
}
notify_backup "/usr/local/bin/vrrp_notify_backup.sh"
notify_fault "/usr/local/bin/vrrp_notify_fault.sh"
notify_master "/usr/local/bin/vrrp_notify_master.sh"
}
vrrp_instance router_v6 {
interface INTERFACE
virtual_router_id 1
priority PRIORITY
advert_int 1
virtual_ipaddress {
2a0a:e5c0:1:8::48/128 dev eth1.8 # Transfer for routing from outside
2a0a:e5c0:0:44::7/64 dev bond0.18 # zhaw
2a0a:e5c0:2:15::7/64 dev bond0.20 #
}
}
```
This is a template that we distribute via [cdist](https:/cdi.st). The
strings INTERFACE and PRIORITY are replaced via cdist. The interface
field defines which interface to use for VRRP communication and the
priority field determines which of the routers is the active one.
So far, so good. However let's have a look at a tiny detail of this
configuration file:
```
notify_backup "/usr/local/bin/vrrp_notify_backup.sh"
notify_fault "/usr/local/bin/vrrp_notify_fault.sh"
notify_master "/usr/local/bin/vrrp_notify_master.sh"
```
These three lines basically say: "start something if you are the
master" and "stop something in case you are not". And why did we do
this? Because of stateful services.
## Stateful services
A typical shell script that we would call containes lines like this:
```
/etc/init.d/radvd stop
/etc/init.d/dhcpd stop
```
(or start in the case of the master version)
In earlier days, this even contained openvpn, which was running on our
first generation router version. But more about OpenVPN later.
The reason why we stopped and started dhcp and radvd is to make
clients of the network use the active router. We used radvd to provide
IPv6 addresses as the primary access method to servers. And we used
dhcp mainly to allow servers to netboot. The active router would
carry state (firewall!) and thus the flow of packets always need to go
through the active router.
Restarting radvd on a different machine keeps the IPv6 addresses the
same, as clients assign then themselves using EUI-64. In case of dhcp
(IPv4) we would have used hardcoded IPv4 addresses using a mapping of
MAC address to IPv4 address, but we opted out for this. The main
reason is that dhcp clients re-request their same leas and even if an
IPv4 addresses changes, it is not really of importance.
During a failover this would lead to a few seconds interrupt and
re-establishing sessions. Given that routers are usually rather stable
and restarting them is not a daily task, we initially accepted this.
## Keepalived/VRRP changes
One of the more tricky things is changes to keepalived. Because
keepalived uses the *number of addresses and routes* to verify
that the received VRRP packet matches its configuration, adding or
deleting IP addresses and routes, causes a problem:
While one router was updated, the number of IP addresses or routes is
different. This causes both routers to ignore the others VRRP messages
and both routers think they should be the master process.
This leads to the problem that both routers receive client and outside
traffic. This causes the firewall (nftables) to not recognise
returning packets, if they were sent out by router1, but received back
by router2 and, because nftables is configured *stateful*, will drop
the returning packet.
However not only changes to the configuration can trigger this
problem, but also any communication problem between the two
routers. Since 2017 we experienced it multiple times that keepalived
was unable to receive or send messages from the other router and thus
both of them again became the master process.
## Take away
While in theory keepalived should improve the reliability, in practice
the number of problems due to double master situations we had, made us
question whether the keepalived concept is the fitting one for us.
You can read how we evolved from this setup in
[the next blog article](/u/blog/datacenterlight-ipv6-only-netboot/).