+blog: data center light status

2021-03-15 20:41:40 +01:00 · 2021-03-15 20:41:40 +01:00 · c028ca0893
commit c028ca0893
parent 259f0e5d12
1 changed files with 88 additions and 0 deletions
--- a/content/u/blog/datacenterlight-report-2021-03-15/contents.lr
+++ b/content/u/blog/datacenterlight-report-2021-03-15/contents.lr
@ -0,0 +1,88 @@
 title: Data Center Light Report 2021-03-15
 ---
 pub_date: 2021-03-15
 ---
 author: ungleich
 ---
 twitter_handle: ungleich
 ---
 _hidden: no
 ---
 _discoverable: no
 ---
 abstract:
 Report of the state of Data Center Light
 ---
 body:
 In this report we reveal the partial network outage that happened
 2021-03-15 between ca. 1700 and 1736 in place6.
 ## 1700: router reconfiguration
 At 17:00:45 a planned, regular reconfiguration of two firewall/routing
 systems was applied.
 ## 1702: outage detection
 A few minutes later our team noticed that some virtualisation hosts
 became unreachable.
 ## 1702..1733: staff gathering and remote analysis
 Our staff is working in multiple locations (place5, place6, place7,
 place10) and our core team agreed to meet onsite. During this time we
 discovered that all routers are reachable, but all virtualisation
 servers are unreachable. There was no power outage, no switch
 reconfiguration, no visible network loops.
 ## 1733..1736: onsite debugging and problem solution
 Onsite we discovered that the backup router had configuration error in
 the keepalived configuration that prevented restarting the keepalived
 process. As keepalived is only manually restarted, this bug was not
 spotted until we tried switching traffic to the backup router.
 After this bug was discovered and consequently the patch applied to
 the production router, restarting the keepalived process on the main
 router restored network connectivity.
 ## Post mortem
 Just after the problem solution we continued investigating, as the
 virtualisation infrastructure should continue working without this
 specific set of routers. The affected routers are only used for
 routing and firewalling servers, but do not pass the traffic of VMs.
 However these routers are responsible for assigning addresses via
 router advertisements to all servers with the software *radvd*.
 Radvd is by default disabled on the routers and only started by
 keepalived, if the router is the main router (based on the VRRP
 protocol). So when we restarted keepalived, it would also restart
 radvd. And from what we can see, radvd on the main router started
 crashing recently:
 ```
 [20853563.567958] radvd[14796]: segfault at 47b94bc8 ip 00007f042397d53b sp 00007ffd47b94b28 error 4 in ld-musl-x86_64.so.1[7f0423942000+48000]
 ```
 Thus restarting keepalived triggered a restart of radvd, which then in
 turn restored connectivity of the servers. We are not fully aware
 which router advertisements were sent prior to the crash (because we
 don't log network traffic), but there is an option for radvd to
 announce that the router is leaving the network and clients should
 remove their IP addresses.
 We also have seen that the servers logged an outage in the ceph
 cluster, which is likely due to the loss of IP addresses in the
 storage cluster, again due to the crash of radvd. That again caused
 VMs to have stale I/O.
 ## Follow up
 Our monitoring so far did not monitor whether or not one radvd process
 was running, which might have led to faster resolution of this
 incident. We will add a new expression to our monitoring that "there
 is one radvd process running on either of the two routers" to get a
 better insight in the future.