title: Data Center Light Report 2021-03-15 --- pub_date: 2021-03-15 --- author: ungleich --- twitter_handle: ungleich --- _hidden: no --- _discoverable: no --- abstract: Report of the state of Data Center Light --- body: In this report we reveal the partial network outage that happened 2021-03-15 between ca. 1700 and 1736 in place6. ## 1700: router reconfiguration At 17:00:45 a planned, regular reconfiguration of two firewall/routing systems was applied. ## 1702: outage detection A few minutes later our team noticed that some virtualisation hosts became unreachable. ## 1702..1733: staff gathering and remote analysis Our staff is working in multiple locations (place5, place6, place7, place10) and our core team agreed to meet onsite. During this time we discovered that all routers are reachable, but all virtualisation servers are unreachable. There was no power outage, no switch reconfiguration, no visible network loops. ## 1733..1736: onsite debugging and problem solution Onsite we discovered that the backup router had configuration error in the keepalived configuration that prevented restarting the keepalived process. As keepalived is only manually restarted, this bug was not spotted until we tried switching traffic to the backup router. After this bug was discovered and consequently the patch applied to the production router, restarting the keepalived process on the main router restored network connectivity. ## Post mortem Just after the problem solution we continued investigating, as the virtualisation infrastructure should continue working without this specific set of routers. The affected routers are only used for routing and firewalling servers, but do not pass the traffic of VMs. However these routers are responsible for assigning addresses via router advertisements to all servers with the software *radvd*. Radvd is by default disabled on the routers and only started by keepalived, if the router is the main router (based on the VRRP protocol). So when we restarted keepalived, it would also restart radvd. And from what we can see, radvd on the main router started crashing recently: ``` [20853563.567958] radvd[14796]: segfault at 47b94bc8 ip 00007f042397d53b sp 00007ffd47b94b28 error 4 in ld-musl-x86_64.so.1[7f0423942000+48000] ``` Thus restarting keepalived triggered a restart of radvd, which then in turn restored connectivity of the servers. We are not fully aware which router advertisements were sent prior to the crash (because we don't log network traffic), but there is an option for radvd to announce that the router is leaving the network and clients should remove their IP addresses. We also have seen that the servers logged an outage in the ceph cluster, which is likely due to the loss of IP addresses in the storage cluster, again due to the crash of radvd. That again caused VMs to have stale I/O. ## Follow up Our monitoring so far did not monitor whether or not one radvd process was running, which might have led to faster resolution of this incident. We will add a new expression to our monitoring that "there is one radvd process running on either of the two routers" to get a better insight in the future.