You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

Incident Description

The BRIAN Sensu cluster for scheduling SNMP polling checks had an outage for approximately 1 hour total, spanning Sunday/Monday. No Counters were fetched from routers & saved in InfluxDB during this time.


The reason for degradation:

  • Network outage resulted in loss of connectivity between all sensu cluster nodes (prod-poller-sensu-agent(01|02|03).geant.org)
  • Resulted in complete loss of clustering, causing sensu to unschedule all checks.


The impact of this service degradation was:

  • No interfaces were polled between approximately 16:00 and 16:50 UTC, resulting in loss of data on the Production BRIAN instance.
  • No interfaces were polled between approximately 10:20 and 10:35 UTC, resulting in loss of data on the Production BRIAN instance, due to re-boot of the degraded Sensu cluster.


Incident severity: CRITICAL Temporary service outage

Data loss: YES

Total duration of incident: ~18 hours


Timeline

All times are in UTC

DateTime (UTC)Description

 

12:52:37

The first evidence of this incident appeared in the logs of prod-poller-processor.geant.org. remove_spikes_interface_rates is one of several stream functions in the data processing pipeline required for the data displayed in BRIAN.

May 30 12:52:37 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:37.802Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=remove_spikes_gwsd_rates node=influxdb_out3 err=timeout

May 30 12:52:38 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:38.069Z lvl=error msg="encountered error" service=kapacitor task_master=main task=remove_spikes_interface_rates node=remove_spikes2 err="keepalive timedout, last keepalive received was: 2022-05-30 12:52:28.069298439 +0000 UTC"

31 May 202211:56

Keith Slater informed APMs - BRIAN is back to normal operation.

Proposed Solution

  • TBD
  • No labels