Incident Description

New BRIAN bitrate traffic was not available for approximately 21 hours.

The reason for degradation:

cf. IT incident: 30052022
Local system partition corruption
Failure to connect or write data to InfluxDB

The impact of this service degradation was:

Data points between 30 May 2022 12:50 and 31 May 2022 9:30 UTC were temporarily not visible in the BRIAN gui

Incident severity: CRITICAL Temporary service outage

Data loss: NO

Total duration of incident: 21 hours

Timeline

All times are in UTC

Date	Time (UTC)	Description
30 May 2022	12:52:37	The first evidence of this incident appeared in the logs of `prod-poller-processor.geant.org`. `remove_spikes_interface_rates` is one of several stream functions in the data processing pipeline required for the data displayed in BRIAN. `May 30 12:52:37 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:37.802Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=remove_spikes_gwsd_rates node=influxdb_out3 err=timeout` `May 30 12:52:38 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:38.069Z lvl=error msg="encountered error" service=kapacitor task_master=main task=remove_spikes_interface_rates node=remove_spikes2 err="keepalive timedout, last keepalive received was: 2022-05-30 12:52:28.069298439 +0000 UTC"`
30 May 2022	afternoon	Several performance issues started being reported across the network: EMS was failing to resolve the influx cluster hostnames `puppet` was failing or taking a very long time to complete on many vm's
30 May 2022	19:08	Keith Slater (and others) alerted on the `#brian` channel that data was missing in the BRIAN gui
30 May 2022	20:30	Bjarke Madsen replied that it seemed related to service problems seen earlier in the day
30 May 2022	21:12	Massimiliano Adamo replied on `#swd-private` that we had raised an issue with VMWare regarding storage device failure.
30 May 2022	23:28	Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down
30 May 2022	12:53	For the duration of this event, Kapacitor continuously logged failures regarding writing to or communicating with InfluxDB, as below: `May 31 00:49:08 prod-poller-processor kapacitord[54933]: ts=2022-05-31T00:49:08.133Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=interface_rates node=influxdb_out12 err=timeout` `May 31 01:26:44 prod-poller-processor kapacitord[54933]: ts=2022-05-31T01:26:44.163Z lvl=error msg="failed to connect to InfluxDB, retrying..." service=influxdb cluster=read err="Get https://influx-cluster.service.ha.geant.org:8086/ping: dial tcp: lookup influx-cluster.service.ha.geant.org on 83.97.93.200:53: no such host"` This means that while Kapacitor was receiving live network counters in real time, the results of the rate calculations weren't being saved to InfluxDB.
31 May 2022	08:12
31 May 2022	02:34-08:11	There were many incidents of disk i/o failure logged over the duration of the event, indicating filesystem/disk corruption. For example: `May 31 02:34:59 prod-poller-processor kernel: [1754481.561423] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561427] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 02 05 63 f0 00 00 08 00` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561442] EXT4-fs warning (device dm-0): ext4_end_bio:302: I/O error -5 writing to inode 149650 (offset 0 size 0 starting block 702078)` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561494] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561497] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 ef d3 18 00 00 10 00` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561529] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561532] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 af dd b0 00 00 10 00` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561550] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561553] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 f3 b3 d0 00 00 08 00` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561567] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1086s` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561569] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 aa 75 90 00 00 18 00` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561579] EXT4-fs warning (device dm-4): ext4_end_bio:302: I/O error -5 writing to inode 73 (offset 4927488 size 8192 starting block 1005746 )` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561632] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561635] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 ef b1 90 00 00 08 00` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561675] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561678] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 ef d3 48 00 00 10 00` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561696] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561699] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 ef d3 68 00 00 a0 00` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561758] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561761] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 af c6 68 00 00 20 00` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561789] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561792] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 ef b4 f0 00 00 10 00` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561814] EXT4-fs warning (device dm-0): ext4_end_bio:302: I/O error -5 writing to inode 131114 (offset 0 size 0 starting block 1029894)` `May 31 02:34:59 prod-poller-processor kernel: [1754481.561852] EXT4-fs warning (device dm-0): ext4_end_bio:302: I/O error -5 writing to inode 131105 (offset 14843904 size 12288 starting block 261672)` `May 31 02:34:59 prod-poller-processor kernel: [1754481.626924] JBD2: Detected IO errors while flushing file data on dm-0-8`
31 May 2022	07:34	Keith Slater took ownership of informing APM's
31 May 2022	08:12	Pete Pedersen stopped the system and fixed the corrupt partition.
31 May 2022	08:26:55	System was rebooted.
31 May 2022	08:26:55	There was a network DNS failure during the boot process and `haproxy` failed to start, because `prod-inventory-provider01.geant.org` and `prod-inventory-provider02.geant.org` couldn't be resolved: `May 31 08:26:55 prod-poller-processor haproxy[976]: [ALERT] 150/082655 (976) : parsing [/etc/haproxy/haproxy.cfg:30] : 'server prod-inventory-provider01.geant.org' : could not resolve address 'prod-inventory-provider01.geant.org'.` `May 31 08:26:55 prod-poller-processor haproxy[976]: [ALERT] 150/082655 (976) : parsing [/etc/haproxy/haproxy.cfg:31] : 'server prod-inventory-provider02.geant.org' : could not resolve address 'prod-inventory-provider02.geant.org'.`
31 May 2022	08:27:07	`Kapacitor` tasks failed to run because the `haproxy` service wasn't running, for example: `May 31 08:27:07 prod-poller-processor kapacitord[839]: ts=2022-05-31T08:27:07.962Z lvl=info msg="UDF log" service=kapacitor task_master=main task=service_enrichment node=inventory_enrichment2 text="urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /poller/interfaces (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f749f4a2978>: Failed to establish a new connection: [Errno 111] Connection refused',))"` Since the Kapacitor tasks weren't running, network counters were still not being processed or saved to InfluxDB.
31 May 2022	08:41:11	`puppet` ran automatically and restarted `haproxy`. At this time DNS resolution was back to normal, and `haproxy` successfully started But `Kapacitor` tasks were still in a non-executing state, therefore data was still not being processed.
31 May 2022	09:27:10	Manual restart of `Kapacitor`. Normal BRIAN data processing of real-time data was restored.
31 May 2022	10:39	Sam Roberts copied the data points lost during the incident from UAT to production `interface_rates` `dscp32_rates` `gwsd_rates` `multicast_rates`
31 May 2022	11:56	Keith Slater informed APMs - BRIAN is back to normal operation.

Proposed Solution

The core issue seems to be related to VMWare and IT need to provide a solution. S.M.A.R.T. alerts have been found in the vCenter, but monitoring has not been configured to detect these alerts.
This incident suggests that a previously logged technical debt issue (POL1-529), which has been considered medium/low priority, could be prioritized for development:
- fixing this issue could generally help with temporary DNS resolution errors, however the DNS issues were secondary in this incident and fixing this issue wouldn't have prevented the overall outage
while VMWare disk corruption and network dns failures are external events and out of the control of SWD, a further investigation for potential improvement in processing resiliency is described in POL1-607.

Page tree

BRIAN - 2022-05-30/31 - Service Outage

Incident Description

Timeline

Proposed Solution