162 lines
5.5 KiB
Text
162 lines
5.5 KiB
Text
|
title: Data Center Light: Spring network cleanup
|
||
|
---
|
||
|
pub_date: 2021-05-01
|
||
|
---
|
||
|
author: Nico Schottelius
|
||
|
---
|
||
|
twitter_handle: NicoSchottelius
|
||
|
---
|
||
|
_hidden: no
|
||
|
---
|
||
|
_discoverable: no
|
||
|
---
|
||
|
abstract:
|
||
|
From today on ungleich offers free, encrypted IPv6 VPNs for hackerspaces
|
||
|
---
|
||
|
body:
|
||
|
|
||
|
## Introduction
|
||
|
|
||
|
Spring is the time for cleanup. Cleanup up your apartment, removing
|
||
|
dust from the cabinet, letting the light shine through the windows,
|
||
|
or like in our case: improving the networking situation.
|
||
|
|
||
|
In this article we give an introduction of where we started and what
|
||
|
the typical setup used to be in our data center.
|
||
|
|
||
|
## Best practice
|
||
|
|
||
|
When we started [Data Center Light](https://datacenterlight.ch) in
|
||
|
2017, we orientated ourselves at "best practice" for networking. We
|
||
|
started with IPv6 only networks and used RFC1918 network (10/8) for
|
||
|
internal IPv4 routing.
|
||
|
|
||
|
And we started with 2 routers for every network to provide
|
||
|
redundancy.
|
||
|
|
||
|
## Router redundancy
|
||
|
|
||
|
So what do you do when you have two routers? In the Linux world the
|
||
|
software [keepalived](https://keepalived.org/)
|
||
|
is very popular to provide redundant routing
|
||
|
using the [VRRP protocol](https://en.wikipedia.org/wiki/Virtual_Router_Redundancy_Protocol).
|
||
|
|
||
|
## Active-Passive
|
||
|
|
||
|
While VRRP is designed to allow multiple (not only two) routers to
|
||
|
co-exist in a network, its design is basically active-passive: you
|
||
|
have one active router and n passive routers, in our case 1
|
||
|
additional.
|
||
|
|
||
|
## Keepalived: a closer look
|
||
|
|
||
|
A typical keepalived configuration in our network looked like this:
|
||
|
|
||
|
```
|
||
|
vrrp_instance router_v4 {
|
||
|
interface INTERFACE
|
||
|
virtual_router_id 2
|
||
|
priority PRIORITY
|
||
|
advert_int 1
|
||
|
virtual_ipaddress {
|
||
|
10.0.0.1/22 dev eth1.5 # Internal
|
||
|
}
|
||
|
notify_backup "/usr/local/bin/vrrp_notify_backup.sh"
|
||
|
notify_fault "/usr/local/bin/vrrp_notify_fault.sh"
|
||
|
notify_master "/usr/local/bin/vrrp_notify_master.sh"
|
||
|
}
|
||
|
|
||
|
vrrp_instance router_v6 {
|
||
|
interface INTERFACE
|
||
|
virtual_router_id 1
|
||
|
priority PRIORITY
|
||
|
advert_int 1
|
||
|
virtual_ipaddress {
|
||
|
2a0a:e5c0:1:8::48/128 dev eth1.8 # Transfer for routing from outside
|
||
|
2a0a:e5c0:0:44::7/64 dev bond0.18 # zhaw
|
||
|
2a0a:e5c0:2:15::7/64 dev bond0.20 #
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
This is a template that we distribute via [cdist](https:/cdi.st). The
|
||
|
strings INTERFACE and PRIORITY are replaced via cdist. The interface
|
||
|
field defines which interface to use for VRRP communication and the
|
||
|
priority field determines which of the routers is the active one.
|
||
|
|
||
|
So far, so good. However let's have a look at a tiny detail of this
|
||
|
configuration file:
|
||
|
|
||
|
```
|
||
|
notify_backup "/usr/local/bin/vrrp_notify_backup.sh"
|
||
|
notify_fault "/usr/local/bin/vrrp_notify_fault.sh"
|
||
|
notify_master "/usr/local/bin/vrrp_notify_master.sh"
|
||
|
```
|
||
|
|
||
|
These three lines basically say: "start something if you are the
|
||
|
master" and "stop something in case you are not". And why did we do
|
||
|
this? Because of stateful services.
|
||
|
|
||
|
## Stateful services
|
||
|
|
||
|
A typical shell script that we would call containes lines like this:
|
||
|
|
||
|
```
|
||
|
/etc/init.d/radvd stop
|
||
|
/etc/init.d/dhcpd stop
|
||
|
```
|
||
|
(or start in the case of the master version)
|
||
|
|
||
|
In earlier days, this even contained openvpn, which was running on our
|
||
|
first generation router version. But more about OpenVPN later.
|
||
|
|
||
|
The reason why we stopped and started dhcp and radvd is to make
|
||
|
clients of the network use the active router. We used radvd to provide
|
||
|
IPv6 addresses as the primary access method to servers. And we used
|
||
|
dhcp mainly to allow servers to netboot. The active router would
|
||
|
carry state (firewall!) and thus the flow of packets always need to go
|
||
|
through the active router.
|
||
|
|
||
|
Restarting radvd on a different machine keeps the IPv6 addresses the
|
||
|
same, as clients assign then themselves using EUI-64. In case of dhcp
|
||
|
(IPv4) we would have used hardcoded IPv4 addresses using a mapping of
|
||
|
MAC address to IPv4 address, but we opted out for this. The main
|
||
|
reason is that dhcp clients re-request their same leas and even if an
|
||
|
IPv4 addresses changes, it is not really of importance.
|
||
|
|
||
|
During a failover this would lead to a few seconds interrupt and
|
||
|
re-establishing sessions. Given that routers are usually rather stable
|
||
|
and restarting them is not a daily task, we initially accepted this.
|
||
|
|
||
|
## Keepalived/VRRP changes
|
||
|
|
||
|
One of the more tricky things is changes to keepalived. Because
|
||
|
keepalived uses the *number of addresses and routes* to verify
|
||
|
that the received VRRP packet matches its configuration, adding or
|
||
|
deleting IP addresses and routes, causes a problem:
|
||
|
|
||
|
While one router was updated, the number of IP addresses or routes is
|
||
|
different. This causes both routers to ignore the others VRRP messages
|
||
|
and both routers think they should be the master process.
|
||
|
|
||
|
This leads to the problem that both routers receive client and outside
|
||
|
traffic. This causes the firewall (nftables) to not recognise
|
||
|
returning packets, if they were sent out by router1, but received back
|
||
|
by router2 and, because nftables is configured *stateful*, will drop
|
||
|
the returning packet.
|
||
|
|
||
|
However not only changes to the configuration can trigger this
|
||
|
problem, but also any communication problem between the two
|
||
|
routers. Since 2017 we experienced it multiple times that keepalived
|
||
|
was unable to receive or send messages from the other router and thus
|
||
|
both of them again became the master process.
|
||
|
|
||
|
## Take away
|
||
|
|
||
|
While in theory keepalived should improve the reliability, in practice
|
||
|
the number of problems due to double master situations we had, made us
|
||
|
question whether the keepalived concept is the fitting one for us.
|
||
|
|
||
|
You can read how we evolved from this setup in
|
||
|
[the next blog article](/u/blog/datacenterlight-ipv6-only-netboot/).
|