+blog: data center light status

2021-03-15 20:41:40 +01:00 · 2021-03-15 20:41:40 +01:00 · c028ca0893
commit c028ca0893
parent 259f0e5d12
1 changed files with 88 additions and 0 deletions
--- a/content/u/blog/datacenterlight-report-2021-03-15/contents.lr
+++ b/content/u/blog/datacenterlight-report-2021-03-15/contents.lr
@ -0,0 +1,88 @@
+title: Data Center Light Report 2021-03-15
+---
+pub_date: 2021-03-15
+---
+author: ungleich
+---
+twitter_handle: ungleich
+---
+_hidden: no
+---
+_discoverable: no
+---
+abstract:
+Report of the state of Data Center Light
+---
+body:
+
+In this report we reveal the partial network outage that happened
+2021-03-15 between ca. 1700 and 1736 in place6.
+
+## 1700: router reconfiguration
+
+At 17:00:45 a planned, regular reconfiguration of two firewall/routing
+systems was applied.
+
+## 1702: outage detection
+
+A few minutes later our team noticed that some virtualisation hosts
+became unreachable.
+
+## 1702..1733: staff gathering and remote analysis
+
+Our staff is working in multiple locations (place5, place6, place7,
+place10) and our core team agreed to meet onsite. During this time we
+discovered that all routers are reachable, but all virtualisation
+servers are unreachable. There was no power outage, no switch
+reconfiguration, no visible network loops.
+
+## 1733..1736: onsite debugging and problem solution
+
+Onsite we discovered that the backup router had configuration error in
+the keepalived configuration that prevented restarting the keepalived
+process. As keepalived is only manually restarted, this bug was not
+spotted until we tried switching traffic to the backup router.
+
+After this bug was discovered and consequently the patch applied to
+the production router, restarting the keepalived process on the main
+router restored network connectivity.
+
+## Post mortem
+
+Just after the problem solution we continued investigating, as the
+virtualisation infrastructure should continue working without this
+specific set of routers. The affected routers are only used for
+routing and firewalling servers, but do not pass the traffic of VMs.
+
+However these routers are responsible for assigning addresses via
+router advertisements to all servers with the software *radvd*.
+
+Radvd is by default disabled on the routers and only started by
+keepalived, if the router is the main router (based on the VRRP
+protocol). So when we restarted keepalived, it would also restart
+radvd. And from what we can see, radvd on the main router started
+crashing recently:
+
+```
+[20853563.567958] radvd[14796]: segfault at 47b94bc8 ip 00007f042397d53b sp 00007ffd47b94b28 error 4 in ld-musl-x86_64.so.1[7f0423942000+48000]
+```
+
+Thus restarting keepalived triggered a restart of radvd, which then in
+turn restored connectivity of the servers. We are not fully aware
+which router advertisements were sent prior to the crash (because we
+don't log network traffic), but there is an option for radvd to
+announce that the router is leaving the network and clients should
+remove their IP addresses.
+
+We also have seen that the servers logged an outage in the ceph
+cluster, which is likely due to the loss of IP addresses in the
+storage cluster, again due to the crash of radvd. That again caused
+VMs to have stale I/O.
+
+## Follow up
+
+Our monitoring so far did not monitor whether or not one radvd process
+was running, which might have led to faster resolution of this
+incident. We will add a new expression to our monitoring that "there
+is one radvd process running on either of the two routers" to get a
+better insight in the future.