89 lines
3.1 KiB
Text
89 lines
3.1 KiB
Text
|
title: Data Center Light Report 2021-03-15
|
||
|
---
|
||
|
pub_date: 2021-03-15
|
||
|
---
|
||
|
author: ungleich
|
||
|
---
|
||
|
twitter_handle: ungleich
|
||
|
---
|
||
|
_hidden: no
|
||
|
---
|
||
|
_discoverable: no
|
||
|
---
|
||
|
abstract:
|
||
|
Report of the state of Data Center Light
|
||
|
---
|
||
|
body:
|
||
|
|
||
|
In this report we reveal the partial network outage that happened
|
||
|
2021-03-15 between ca. 1700 and 1736 in place6.
|
||
|
|
||
|
## 1700: router reconfiguration
|
||
|
|
||
|
At 17:00:45 a planned, regular reconfiguration of two firewall/routing
|
||
|
systems was applied.
|
||
|
|
||
|
## 1702: outage detection
|
||
|
|
||
|
A few minutes later our team noticed that some virtualisation hosts
|
||
|
became unreachable.
|
||
|
|
||
|
## 1702..1733: staff gathering and remote analysis
|
||
|
|
||
|
Our staff is working in multiple locations (place5, place6, place7,
|
||
|
place10) and our core team agreed to meet onsite. During this time we
|
||
|
discovered that all routers are reachable, but all virtualisation
|
||
|
servers are unreachable. There was no power outage, no switch
|
||
|
reconfiguration, no visible network loops.
|
||
|
|
||
|
## 1733..1736: onsite debugging and problem solution
|
||
|
|
||
|
Onsite we discovered that the backup router had configuration error in
|
||
|
the keepalived configuration that prevented restarting the keepalived
|
||
|
process. As keepalived is only manually restarted, this bug was not
|
||
|
spotted until we tried switching traffic to the backup router.
|
||
|
|
||
|
After this bug was discovered and consequently the patch applied to
|
||
|
the production router, restarting the keepalived process on the main
|
||
|
router restored network connectivity.
|
||
|
|
||
|
## Post mortem
|
||
|
|
||
|
Just after the problem solution we continued investigating, as the
|
||
|
virtualisation infrastructure should continue working without this
|
||
|
specific set of routers. The affected routers are only used for
|
||
|
routing and firewalling servers, but do not pass the traffic of VMs.
|
||
|
|
||
|
However these routers are responsible for assigning addresses via
|
||
|
router advertisements to all servers with the software *radvd*.
|
||
|
|
||
|
Radvd is by default disabled on the routers and only started by
|
||
|
keepalived, if the router is the main router (based on the VRRP
|
||
|
protocol). So when we restarted keepalived, it would also restart
|
||
|
radvd. And from what we can see, radvd on the main router started
|
||
|
crashing recently:
|
||
|
|
||
|
```
|
||
|
[20853563.567958] radvd[14796]: segfault at 47b94bc8 ip 00007f042397d53b sp 00007ffd47b94b28 error 4 in ld-musl-x86_64.so.1[7f0423942000+48000]
|
||
|
```
|
||
|
|
||
|
Thus restarting keepalived triggered a restart of radvd, which then in
|
||
|
turn restored connectivity of the servers. We are not fully aware
|
||
|
which router advertisements were sent prior to the crash (because we
|
||
|
don't log network traffic), but there is an option for radvd to
|
||
|
announce that the router is leaving the network and clients should
|
||
|
remove their IP addresses.
|
||
|
|
||
|
We also have seen that the servers logged an outage in the ceph
|
||
|
cluster, which is likely due to the loss of IP addresses in the
|
||
|
storage cluster, again due to the crash of radvd. That again caused
|
||
|
VMs to have stale I/O.
|
||
|
|
||
|
## Follow up
|
||
|
|
||
|
Our monitoring so far did not monitor whether or not one radvd process
|
||
|
was running, which might have led to faster resolution of this
|
||
|
incident. We will add a new expression to our monitoring that "there
|
||
|
is one radvd process running on either of the two routers" to get a
|
||
|
better insight in the future.
|