+blog: data center light status

This commit is contained in:
Nico Schottelius 2021-03-15 20:41:40 +01:00
parent 259f0e5d12
commit c028ca0893
1 changed files with 88 additions and 0 deletions

View File

@ -0,0 +1,88 @@
title: Data Center Light Report 2021-03-15
---
pub_date: 2021-03-15
---
author: ungleich
---
twitter_handle: ungleich
---
_hidden: no
---
_discoverable: no
---
abstract:
Report of the state of Data Center Light
---
body:
In this report we reveal the partial network outage that happened
2021-03-15 between ca. 1700 and 1736 in place6.
## 1700: router reconfiguration
At 17:00:45 a planned, regular reconfiguration of two firewall/routing
systems was applied.
## 1702: outage detection
A few minutes later our team noticed that some virtualisation hosts
became unreachable.
## 1702..1733: staff gathering and remote analysis
Our staff is working in multiple locations (place5, place6, place7,
place10) and our core team agreed to meet onsite. During this time we
discovered that all routers are reachable, but all virtualisation
servers are unreachable. There was no power outage, no switch
reconfiguration, no visible network loops.
## 1733..1736: onsite debugging and problem solution
Onsite we discovered that the backup router had configuration error in
the keepalived configuration that prevented restarting the keepalived
process. As keepalived is only manually restarted, this bug was not
spotted until we tried switching traffic to the backup router.
After this bug was discovered and consequently the patch applied to
the production router, restarting the keepalived process on the main
router restored network connectivity.
## Post mortem
Just after the problem solution we continued investigating, as the
virtualisation infrastructure should continue working without this
specific set of routers. The affected routers are only used for
routing and firewalling servers, but do not pass the traffic of VMs.
However these routers are responsible for assigning addresses via
router advertisements to all servers with the software *radvd*.
Radvd is by default disabled on the routers and only started by
keepalived, if the router is the main router (based on the VRRP
protocol). So when we restarted keepalived, it would also restart
radvd. And from what we can see, radvd on the main router started
crashing recently:
```
[20853563.567958] radvd[14796]: segfault at 47b94bc8 ip 00007f042397d53b sp 00007ffd47b94b28 error 4 in ld-musl-x86_64.so.1[7f0423942000+48000]
```
Thus restarting keepalived triggered a restart of radvd, which then in
turn restored connectivity of the servers. We are not fully aware
which router advertisements were sent prior to the crash (because we
don't log network traffic), but there is an option for radvd to
announce that the router is leaving the network and clients should
remove their IP addresses.
We also have seen that the servers logged an outage in the ceph
cluster, which is likely due to the loss of IP addresses in the
storage cluster, again due to the crash of radvd. That again caused
VMs to have stale I/O.
## Follow up
Our monitoring so far did not monitor whether or not one radvd process
was running, which might have led to faster resolution of this
incident. We will add a new expression to our monitoring that "there
is one radvd process running on either of the two routers" to get a
better insight in the future.