Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The reason for degradation:

  • Network outage resulted in loss Loss of connectivity between all sensu Sensu cluster nodes (prod-poller-sensu-agent(01|02|03).geant.org)
  • Resulted in complete loss of clusteringa broken cluster, causing sensu Sensu to unschedule all checks until the cluster recovered back to 2 members.


The impact of this service degradation was:

  • No interfaces were polled between approximately 16:00 and 16:50 UTC, resulting in loss of data on the Production BRIAN instance in this time period.
    • After this timespan, sensu-agent01/02 recovered connectivity and the cluster was able to continue scheduling checks, however agent03 was still in a bad state.
  • No interfaces were polled between approximately 10:20 and 10:35 UTC, resulting in loss of data on the Production BRIAN instance in this time period, due to re-boot of the degraded Sensu cluster.

...

Timeline

All times are in UTC

DateTime (UTC)Description

 

12
15:
52
53:
37
23

The first evidence of this incident appeared in the logs of prod-poller-sensu-

processor

agent03.geant.org.

remove_spikes_interface_rates is one of several stream functions in the data processing pipeline required for the data displayed in BRIAN.

May 30 12:52:37 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:37.802Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=remove_spikes_gwsd_rates node=influxdb_out3 err=timeout

May 30 12:52:38 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:38.069Z lvl=error msg="encountered error" service=kapacitor task_master=main task=remove_spikes_interface_rates node=remove_spikes2 err="keepalive timedout, last keepalive received was: 2022-05-30 12:52:28.069298439 +0000 UTC"

31 May 202211:56Keith Slater informed APMs - BRIAN is back to normal operation

It shows loss of connectivity to agent01/02 for the clustering component


Feb 26 15:53:23 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"c3a4da5bb292624d stepped down to follower since quorum is not active","pkg":"raft","time":"2023-02-26T15:53:23Z"}
Feb 26 15:53:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 1606cb21c84ecbd4 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2023-02-26T15:53:25Z"}
Feb 26 15:53:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"error","msg":"failed to read 1606cb21c84ecbd4 on stream MsgApp v2 (read tcp 83.97.93.155:42166-\u003e83.97.95.12:2380: i/o timeout)","pkg":"rafthttp","time":"2023-02-26T15:53:25Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 7f60c195c0cf08a7 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"error","msg":"failed to read 7f60c195c0cf08a7 on stream MsgApp v2 (read tcp 83.97.93.155:45332-\u003e83.97.95.11:2380: i/o timeout)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 1606cb21c84ecbd4 (stream Message reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 7f60c195c0cf08a7 (stream Message reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}

15:53:29
Logs show that subsequent health checks also failed to connect from agent03 to the other cluster members

Feb 26 15:53:29 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T15:53:29Z"} Feb 26 15:53:33 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T15:53:33Z"}

15:55:11

Logs show that (part of?) the etcd component was automatically killed on agent03, but failed to restart due to a port in use  error, which was the state of agent03 until manual recovery the following workday.


Feb 26 15:55:11 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"backend","level":"error","msg":"error starting etcd: listen tcp 0.0.0.0:2380: bind: address already in use","time":"2023-02-26T15:55:11Z"}

15:59:22

prod-poller-sensu-agent02 showed loss of connectivity to other cluster members, which caused it to go from leader to follower.


Feb 26 15:59:22 prod-poller-sensu-agent02 sensu-backend[1594]: {"component":"etcd","level":"warning","msg":"1606cb21c84ecbd4 stepped down to follower since quorum is not active","pkg":"raft","time":"2023-02-26T15:59:22Z"}

15:59:25

prod-poller-sensu-agent01 showed loss of the cluster leader, which caused checks running on it to be unscheduled.


Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"store","error":"etcdserver: no leader","key":"/sensu.io/tessen","level":"warning","msg":"error from watch response","time":"2023-02-26T15:59:25Z"}
Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"schedulerd","error":"etcdserver: no leader","interval":300,"level":"error","msg":"error scheduling check","name":"ifc-rt1.kie.ua.geant.net-xe-0-1-2","namespace":"default","scheduler_type":"round-robin interval","time":"2023-02-26T15:59:25Z"}
Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"schedulerd","error":"etcdserver: no leader","interval":300,"level":"error","msg":"error scheduling check","name":"gwsd-KIFU-Cogent-b","namespace":"default","scheduler_type":"round-robin interval","time":"2023-02-26T15:59:25Z"}

16:15:25

The last healthcheck warning in the logs on agent03 - possibly signifying restored connectivity (THIS IS AN ASSUMPTION)

Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}
Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_SNAPSHOT\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}
Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}
Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_SNAPSHOT\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}

16:23:21

The etcd component was restarted on prod-poller-sensu-agent02 successfully

Feb 26 16:23:21 prod-poller-sensu-agent02 sensu-backend[1594]: {"component":"etcd","level":"warning","msg":"serving insecure client requests on 83.97.95.12:2379, this is strongly discouraged!","pkg":"embed","time":"2023-02-26T16:23:21Z"} 

16:32:08

The etcd component restarted successfully on prod-poller-sensu-agent01 and connectivity was restored between agent01/02, after which checks were scheduled again and functionality was restored over 20-30 minutes.

Feb 26 16:32:08 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"etcd","level":"warning","msg":"serving insecure client requests on 83.97.95.11:2379, this is strongly discouraged!","pkg":"embed","time":"2023-02-26T16:32:08Z"}

09:50:00

Bjarke Madsen (NORDUnet) noticed that prod-poller-sensu-agent03 was in a broken state, alerted through BRIAN email alerts showing connection issues to the API from the day before

Traceback (most recent call last):
  File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='prod-poller-sensu-agent03.geant.org', port=8080): Max retries exceeded with url: /auth (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fc9ae3e56d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

10:11:00
The procedure for restoring the BRIAN sensu cluster to working order was followed Sensu Cluster Disaster Recovery

10:35:00

All checks were added by manually running the brian-polling-manager /update API endpoint until it had added all polling checks to the restored Sensu cluster.

At this point the cluster was restored with full functionality and interfaces were polling again.

Proposed Solution

  • TBD