+blog: data center light status
This commit is contained in:
parent
259f0e5d12
commit
c028ca0893
1 changed files with 88 additions and 0 deletions
88
content/u/blog/datacenterlight-report-2021-03-15/contents.lr
Normal file
88
content/u/blog/datacenterlight-report-2021-03-15/contents.lr
Normal file
|
@ -0,0 +1,88 @@
|
||||||
|
title: Data Center Light Report 2021-03-15
|
||||||
|
---
|
||||||
|
pub_date: 2021-03-15
|
||||||
|
---
|
||||||
|
author: ungleich
|
||||||
|
---
|
||||||
|
twitter_handle: ungleich
|
||||||
|
---
|
||||||
|
_hidden: no
|
||||||
|
---
|
||||||
|
_discoverable: no
|
||||||
|
---
|
||||||
|
abstract:
|
||||||
|
Report of the state of Data Center Light
|
||||||
|
---
|
||||||
|
body:
|
||||||
|
|
||||||
|
In this report we reveal the partial network outage that happened
|
||||||
|
2021-03-15 between ca. 1700 and 1736 in place6.
|
||||||
|
|
||||||
|
## 1700: router reconfiguration
|
||||||
|
|
||||||
|
At 17:00:45 a planned, regular reconfiguration of two firewall/routing
|
||||||
|
systems was applied.
|
||||||
|
|
||||||
|
## 1702: outage detection
|
||||||
|
|
||||||
|
A few minutes later our team noticed that some virtualisation hosts
|
||||||
|
became unreachable.
|
||||||
|
|
||||||
|
## 1702..1733: staff gathering and remote analysis
|
||||||
|
|
||||||
|
Our staff is working in multiple locations (place5, place6, place7,
|
||||||
|
place10) and our core team agreed to meet onsite. During this time we
|
||||||
|
discovered that all routers are reachable, but all virtualisation
|
||||||
|
servers are unreachable. There was no power outage, no switch
|
||||||
|
reconfiguration, no visible network loops.
|
||||||
|
|
||||||
|
## 1733..1736: onsite debugging and problem solution
|
||||||
|
|
||||||
|
Onsite we discovered that the backup router had configuration error in
|
||||||
|
the keepalived configuration that prevented restarting the keepalived
|
||||||
|
process. As keepalived is only manually restarted, this bug was not
|
||||||
|
spotted until we tried switching traffic to the backup router.
|
||||||
|
|
||||||
|
After this bug was discovered and consequently the patch applied to
|
||||||
|
the production router, restarting the keepalived process on the main
|
||||||
|
router restored network connectivity.
|
||||||
|
|
||||||
|
## Post mortem
|
||||||
|
|
||||||
|
Just after the problem solution we continued investigating, as the
|
||||||
|
virtualisation infrastructure should continue working without this
|
||||||
|
specific set of routers. The affected routers are only used for
|
||||||
|
routing and firewalling servers, but do not pass the traffic of VMs.
|
||||||
|
|
||||||
|
However these routers are responsible for assigning addresses via
|
||||||
|
router advertisements to all servers with the software *radvd*.
|
||||||
|
|
||||||
|
Radvd is by default disabled on the routers and only started by
|
||||||
|
keepalived, if the router is the main router (based on the VRRP
|
||||||
|
protocol). So when we restarted keepalived, it would also restart
|
||||||
|
radvd. And from what we can see, radvd on the main router started
|
||||||
|
crashing recently:
|
||||||
|
|
||||||
|
```
|
||||||
|
[20853563.567958] radvd[14796]: segfault at 47b94bc8 ip 00007f042397d53b sp 00007ffd47b94b28 error 4 in ld-musl-x86_64.so.1[7f0423942000+48000]
|
||||||
|
```
|
||||||
|
|
||||||
|
Thus restarting keepalived triggered a restart of radvd, which then in
|
||||||
|
turn restored connectivity of the servers. We are not fully aware
|
||||||
|
which router advertisements were sent prior to the crash (because we
|
||||||
|
don't log network traffic), but there is an option for radvd to
|
||||||
|
announce that the router is leaving the network and clients should
|
||||||
|
remove their IP addresses.
|
||||||
|
|
||||||
|
We also have seen that the servers logged an outage in the ceph
|
||||||
|
cluster, which is likely due to the loss of IP addresses in the
|
||||||
|
storage cluster, again due to the crash of radvd. That again caused
|
||||||
|
VMs to have stale I/O.
|
||||||
|
|
||||||
|
## Follow up
|
||||||
|
|
||||||
|
Our monitoring so far did not monitor whether or not one radvd process
|
||||||
|
was running, which might have led to faster resolution of this
|
||||||
|
incident. We will add a new expression to our monitoring that "there
|
||||||
|
is one radvd process running on either of the two routers" to get a
|
||||||
|
better insight in the future.
|
Loading…
Add table
Reference in a new issue