diff --git a/content/u/blog/datacenterlight-report-2021-03-15/contents.lr b/content/u/blog/datacenterlight-report-2021-03-15/contents.lr new file mode 100644 index 0000000..ef08ce5 --- /dev/null +++ b/content/u/blog/datacenterlight-report-2021-03-15/contents.lr @@ -0,0 +1,88 @@ +title: Data Center Light Report 2021-03-15 +--- +pub_date: 2021-03-15 +--- +author: ungleich +--- +twitter_handle: ungleich +--- +_hidden: no +--- +_discoverable: no +--- +abstract: +Report of the state of Data Center Light +--- +body: + +In this report we reveal the partial network outage that happened +2021-03-15 between ca. 1700 and 1736 in place6. + +## 1700: router reconfiguration + +At 17:00:45 a planned, regular reconfiguration of two firewall/routing +systems was applied. + +## 1702: outage detection + +A few minutes later our team noticed that some virtualisation hosts +became unreachable. + +## 1702..1733: staff gathering and remote analysis + +Our staff is working in multiple locations (place5, place6, place7, +place10) and our core team agreed to meet onsite. During this time we +discovered that all routers are reachable, but all virtualisation +servers are unreachable. There was no power outage, no switch +reconfiguration, no visible network loops. + +## 1733..1736: onsite debugging and problem solution + +Onsite we discovered that the backup router had configuration error in +the keepalived configuration that prevented restarting the keepalived +process. As keepalived is only manually restarted, this bug was not +spotted until we tried switching traffic to the backup router. + +After this bug was discovered and consequently the patch applied to +the production router, restarting the keepalived process on the main +router restored network connectivity. + +## Post mortem + +Just after the problem solution we continued investigating, as the +virtualisation infrastructure should continue working without this +specific set of routers. The affected routers are only used for +routing and firewalling servers, but do not pass the traffic of VMs. + +However these routers are responsible for assigning addresses via +router advertisements to all servers with the software *radvd*. + +Radvd is by default disabled on the routers and only started by +keepalived, if the router is the main router (based on the VRRP +protocol). So when we restarted keepalived, it would also restart +radvd. And from what we can see, radvd on the main router started +crashing recently: + +``` +[20853563.567958] radvd[14796]: segfault at 47b94bc8 ip 00007f042397d53b sp 00007ffd47b94b28 error 4 in ld-musl-x86_64.so.1[7f0423942000+48000] +``` + +Thus restarting keepalived triggered a restart of radvd, which then in +turn restored connectivity of the servers. We are not fully aware +which router advertisements were sent prior to the crash (because we +don't log network traffic), but there is an option for radvd to +announce that the router is leaving the network and clients should +remove their IP addresses. + +We also have seen that the servers logged an outage in the ceph +cluster, which is likely due to the loss of IP addresses in the +storage cluster, again due to the crash of radvd. That again caused +VMs to have stale I/O. + +## Follow up + +Our monitoring so far did not monitor whether or not one radvd process +was running, which might have led to faster resolution of this +incident. We will add a new expression to our monitoring that "there +is one radvd process running on either of the two routers" to get a +better insight in the future.