Incident Description

The BRIAN Sensu cluster for scheduling SNMP polling checks had an outage for approximately 1 hour total, spanning Sunday/Monday. No Counters were fetched from routers & saved in InfluxDB during this time.


The reason for degradation:

  • Loss of connectivity between all Sensu cluster nodes (prod-poller-sensu-agent(01|02|03).geant.org)
  • Resulted in a broken cluster, causing Sensu to unschedule all checks until the cluster recovered back to 2 members.


The impact of this service degradation was:

  • No interfaces were polled between approximately 16:00 and 16:50 UTC, resulting in loss of data on the Production BRIAN instance in this time period.
    • After this timespan, sensu-agent01/02 recovered connectivity and the cluster was able to continue scheduling checks, however agent03 was still in a bad state.
  • No interfaces were polled between approximately 10:20 and 10:35 UTC, resulting in loss of data on the Production BRIAN instance in this time period, due to re-boot of the degraded Sensu cluster.


Incident severity: CRITICAL Temporary service outage

Data loss: YES

Total duration of incident: ~18 hours


Timeline

All times are in UTC

DateTime (UTC)Description

 

15:53:23

The first evidence of this incident appeared in the logs of prod-poller-sensu-agent03.geant.org. It shows loss of connectivity to agent01/02 for the clustering component

Feb 26 15:53:23 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"c3a4da5bb292624d stepped down to follower since quorum is not active","pkg":"raft","time":"2023-02-26T15:53:23Z"}
Feb 26 15:53:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 1606cb21c84ecbd4 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2023-02-26T15:53:25Z"}
Feb 26 15:53:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"error","msg":"failed to read 1606cb21c84ecbd4 on stream MsgApp v2 (read tcp 83.97.93.155:42166-\u003e83.97.95.12:2380: i/o timeout)","pkg":"rafthttp","time":"2023-02-26T15:53:25Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 7f60c195c0cf08a7 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"error","msg":"failed to read 7f60c195c0cf08a7 on stream MsgApp v2 (read tcp 83.97.93.155:45332-\u003e83.97.95.11:2380: i/o timeout)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 1606cb21c84ecbd4 (stream Message reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}
Feb 26 15:53:26 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer 7f60c195c0cf08a7 (stream Message reader)","pkg":"rafthttp","time":"2023-02-26T15:53:26Z"}

15:53:29
Logs show that subsequent health checks also failed to connect from agent03 to the other cluster members

Feb 26 15:53:29 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T15:53:29Z"} Feb 26 15:53:33 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T15:53:33Z"}

15:55:11

Logs show that (part of?) the etcd component was automatically killed on agent03, but failed to restart due to a port in use  error, which was the state of agent03 until manual recovery the following workday.


Feb 26 15:55:11 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"backend","level":"error","msg":"error starting etcd: listen tcp 0.0.0.0:2380: bind: address already in use","time":"2023-02-26T15:55:11Z"}

15:59:22

prod-poller-sensu-agent02 showed loss of connectivity to other cluster members, which caused it to go from leader to follower.


Feb 26 15:59:22 prod-poller-sensu-agent02 sensu-backend[1594]: {"component":"etcd","level":"warning","msg":"1606cb21c84ecbd4 stepped down to follower since quorum is not active","pkg":"raft","time":"2023-02-26T15:59:22Z"}

15:59:25

prod-poller-sensu-agent01 showed loss of the cluster leader, which caused checks running on it to be unscheduled.


Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"store","error":"etcdserver: no leader","key":"/sensu.io/tessen","level":"warning","msg":"error from watch response","time":"2023-02-26T15:59:25Z"}
Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"schedulerd","error":"etcdserver: no leader","interval":300,"level":"error","msg":"error scheduling check","name":"ifc-rt1.kie.ua.geant.net-xe-0-1-2","namespace":"default","scheduler_type":"round-robin interval","time":"2023-02-26T15:59:25Z"}
Feb 26 15:59:25 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"schedulerd","error":"etcdserver: no leader","interval":300,"level":"error","msg":"error scheduling check","name":"gwsd-KIFU-Cogent-b","namespace":"default","scheduler_type":"round-robin interval","time":"2023-02-26T15:59:25Z"}

16:15:25

The last healthcheck warning in the logs on agent03 - possibly signifying restored connectivity (THIS IS AN ASSUMPTION)

Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}
Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_SNAPSHOT\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}
Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 1606cb21c84ecbd4 could not connect: dial tcp 83.97.95.12:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}
Feb 26 16:15:25 prod-poller-sensu-agent03 sensu-backend[1564]: {"component":"etcd","level":"warning","msg":"health check for peer 7f60c195c0cf08a7 could not connect: dial tcp 83.97.95.11:2380: i/o timeout (prober \"ROUND_TRIPPER_SNAPSHOT\")","pkg":"rafthttp","time":"2023-02-26T16:15:25Z"}

16:23:21

The etcd component was restarted on prod-poller-sensu-agent02 successfully

Feb 26 16:23:21 prod-poller-sensu-agent02 sensu-backend[1594]: {"component":"etcd","level":"warning","msg":"serving insecure client requests on 83.97.95.12:2379, this is strongly discouraged!","pkg":"embed","time":"2023-02-26T16:23:21Z"} 

16:32:08

The etcd component restarted successfully on prod-poller-sensu-agent01 and connectivity was restored between agent01/02, after which checks were scheduled again and functionality was restored over 20-30 minutes.

Feb 26 16:32:08 prod-poller-sensu-agent01 sensu-backend[50997]: {"component":"etcd","level":"warning","msg":"serving insecure client requests on 83.97.95.11:2379, this is strongly discouraged!","pkg":"embed","time":"2023-02-26T16:32:08Z"}

09:50:00

Bjarke Madsen (NORDUnet) noticed that prod-poller-sensu-agent03 was in a broken state, alerted through BRIAN email alerts showing connection issues to the API from the day before

Traceback (most recent call last):
  File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/opt/monitoring-proxies/brian/venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='prod-poller-sensu-agent03.geant.org', port=8080): Max retries exceeded with url: /auth (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fc9ae3e56d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

10:11:00
The procedure for restoring the BRIAN sensu cluster to working order was followed Sensu Cluster Disaster Recovery

10:35:00

All checks were added by manually running the brian-polling-manager /update API endpoint until it had added all polling checks to the restored Sensu cluster.

At this point the cluster was restored with full functionality and interfaces were polling again.

Proposed Solution

  • TBD
  • No labels