+blog: data center light status
This commit is contained in:
parent
259f0e5d12
commit
c028ca0893
1 changed files with 88 additions and 0 deletions
88
content/u/blog/datacenterlight-report-2021-03-15/contents.lr
Normal file
88
content/u/blog/datacenterlight-report-2021-03-15/contents.lr
Normal file
|
@ -0,0 +1,88 @@
|
|||
title: Data Center Light Report 2021-03-15
|
||||
---
|
||||
pub_date: 2021-03-15
|
||||
---
|
||||
author: ungleich
|
||||
---
|
||||
twitter_handle: ungleich
|
||||
---
|
||||
_hidden: no
|
||||
---
|
||||
_discoverable: no
|
||||
---
|
||||
abstract:
|
||||
Report of the state of Data Center Light
|
||||
---
|
||||
body:
|
||||
|
||||
In this report we reveal the partial network outage that happened
|
||||
2021-03-15 between ca. 1700 and 1736 in place6.
|
||||
|
||||
## 1700: router reconfiguration
|
||||
|
||||
At 17:00:45 a planned, regular reconfiguration of two firewall/routing
|
||||
systems was applied.
|
||||
|
||||
## 1702: outage detection
|
||||
|
||||
A few minutes later our team noticed that some virtualisation hosts
|
||||
became unreachable.
|
||||
|
||||
## 1702..1733: staff gathering and remote analysis
|
||||
|
||||
Our staff is working in multiple locations (place5, place6, place7,
|
||||
place10) and our core team agreed to meet onsite. During this time we
|
||||
discovered that all routers are reachable, but all virtualisation
|
||||
servers are unreachable. There was no power outage, no switch
|
||||
reconfiguration, no visible network loops.
|
||||
|
||||
## 1733..1736: onsite debugging and problem solution
|
||||
|
||||
Onsite we discovered that the backup router had configuration error in
|
||||
the keepalived configuration that prevented restarting the keepalived
|
||||
process. As keepalived is only manually restarted, this bug was not
|
||||
spotted until we tried switching traffic to the backup router.
|
||||
|
||||
After this bug was discovered and consequently the patch applied to
|
||||
the production router, restarting the keepalived process on the main
|
||||
router restored network connectivity.
|
||||
|
||||
## Post mortem
|
||||
|
||||
Just after the problem solution we continued investigating, as the
|
||||
virtualisation infrastructure should continue working without this
|
||||
specific set of routers. The affected routers are only used for
|
||||
routing and firewalling servers, but do not pass the traffic of VMs.
|
||||
|
||||
However these routers are responsible for assigning addresses via
|
||||
router advertisements to all servers with the software *radvd*.
|
||||
|
||||
Radvd is by default disabled on the routers and only started by
|
||||
keepalived, if the router is the main router (based on the VRRP
|
||||
protocol). So when we restarted keepalived, it would also restart
|
||||
radvd. And from what we can see, radvd on the main router started
|
||||
crashing recently:
|
||||
|
||||
```
|
||||
[20853563.567958] radvd[14796]: segfault at 47b94bc8 ip 00007f042397d53b sp 00007ffd47b94b28 error 4 in ld-musl-x86_64.so.1[7f0423942000+48000]
|
||||
```
|
||||
|
||||
Thus restarting keepalived triggered a restart of radvd, which then in
|
||||
turn restored connectivity of the servers. We are not fully aware
|
||||
which router advertisements were sent prior to the crash (because we
|
||||
don't log network traffic), but there is an option for radvd to
|
||||
announce that the router is leaving the network and clients should
|
||||
remove their IP addresses.
|
||||
|
||||
We also have seen that the servers logged an outage in the ceph
|
||||
cluster, which is likely due to the loss of IP addresses in the
|
||||
storage cluster, again due to the crash of radvd. That again caused
|
||||
VMs to have stale I/O.
|
||||
|
||||
## Follow up
|
||||
|
||||
Our monitoring so far did not monitor whether or not one radvd process
|
||||
was running, which might have led to faster resolution of this
|
||||
incident. We will add a new expression to our monitoring that "there
|
||||
is one radvd process running on either of the two routers" to get a
|
||||
better insight in the future.
|
Loading…
Reference in a new issue