You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Incident Description

BRIAN bitrate traffic was not computed or saved for approximately 21 hours.


The reason for degradation:

  • Failure to connect or write data to InfluxDB
  • cf. IT incident: 30052022


The impact of this service degradation was:

  • Data points between 12:50 and   9:30 UTC were temporarily not visible in the BRIAN gui


Incident severity: CRITICAL Intermittent service outage

Data loss: NO

Total duration of incident: 21 hours/On going (as of  22:22 UTC)


Timeline

All times are in UTC

DateTime (UTC)Description

 

12:52:37

the first evidence of this incident appeared in the logs of prod-poller-processor.geant.org


May 30 12:52:37 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:37.802Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=remove_spikes_gwsd_rates node=influxdb_out3 err=timeout

May 30 12:52:38 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:38.069Z lvl=error msg="encountered error" service=kapacitor task_master=main task=remove_spikes_interface_rates node=remove_spikes2 err="keepalive timedout, last keepalive received was: 2022-05-30 12:52:28.069298439 +0000 UTC"

 

12:53
continuous failures writing to Influx, or resolving the hostname:


May 31 00:49:08 prod-poller-processor kapacitord[54933]: ts=2022-05-31T00:49:08.133Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=interface_rates node=influxdb_out12 err=timeout

May 31 01:26:44 prod-poller-processor kapacitord[54933]: ts=2022-05-31T01:26:44.163Z lvl=error msg="failed to connect to InfluxDB, retrying..." service=influxdb cluster=read err="Get https://influx-cluster.service.ha.geant.org:8086/ping: dial tcp: lookup influx-cluster.service.ha.geant.org on 83.97.93.200:53: no such host"


 

08:12


May 31 07:02:47 prod-poller-processor kernel: [1770549.826770] JBD2: Detected IO errors while flushing file data on dm-4-8

 

08:12

System was stopped

 

08:26:55

System rebooting and starting up

 

08:26:55

haproxy failed to start because it couldn't resolve prod-inventory-provider0x.geant.org


May 31 08:26:55 prod-poller-processor haproxy[976]: [ALERT] 150/082655 (976) : parsing [/etc/haproxy/haproxy.cfg:30] : 'server prod-inventory-provider01.geant.org' : could not resolve address 'prod-inventory-provider01.geant.org'.

May 31 08:26:55 prod-poller-processor haproxy[976]: [ALERT] 150/082655 (976) : parsing [/etc/haproxy/haproxy.cfg:31] : 'server prod-inventory-provider02.geant.org' : could not resolve address 'prod-inventory-provider02.geant.org'.

 

20:50

Kapacitor tasks failed to run because the haproxy service wasn't running, for example:

May 31 08:27:07 prod-poller-processor kapacitord[839]: ts=2022-05-31T08:27:07.962Z lvl=info msg="UDF log" service=kapacitor task_master=main task=service_enrichment node=inventory_enrichment2 text="urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /poller/interfaces (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f749f4a2978>: Failed to establish a new connection: [Errno 111] Connection refused',))"




Proposed Solution

  • The core issue seems to be related to VMWare and IT need to provide a solution
  • No labels