Incident Description

New BRIAN bitrate traffic was not computed or saved available for approximately 21 hours.

...

The reason for degradation:

cf. IT incident: 30052022
Local system partition corruption
Failure to connect or write data to InfluxDB
Local system partition errors
cf. IT incident: 30052022

The impact of this service degradation was:

...

Incident severity:

Status

colour	Red
title	CRITICAL

Intermittent Temporary service outage

Data loss:

Status

subtle	true
colour	Blue
title	No

Total duration of incident: 21 hours/On going (as of 30 May 2022 22:22 UTC)

Timeline

All times are in UTC

Date	Time (UTC)	Description
30 May 2022	12:52:37

the

The first evidence of this incident appeared in the logs of prod-poller-processor.geant.org

Code BlockMay 30

. remove_spikes_interface_rates is one of several stream functions in the data processing pipeline required for the data displayed in BRIAN.

May 30 12:52:37

prod-poller-processor

kapacitord[124994]:

ts=2022-05-30T12:52:37.802Z

lvl=error

msg="failed

to

write

points

to

InfluxDB"

service=kapacitor

task_master=main

task=remove_spikes_gwsd_rates

node=influxdb_out3

err=timeout

May

30

12:52:38

prod-poller-processor

kapacitord[124994]:

ts=2022-05-30T12:52:38.069Z

lvl=error

msg="encountered

error"

service=kapacitor

task_master=main

task=remove_spikes_interface_rates

node=remove_spikes2

err="keepalive

timedout,

last

keepalive

received

was:

2022-05-30

12:52:28.069298439

+0000

`UTC"`
30 May 2022	afternoon	Several performance issues started

to become apparent

being reported across the network:

ems

EMS was failing to resolve the influx cluster hostnames `puppet` was failing or taking a very long time to complete on many vm's
30 May 2022	19:08	Keith Slater (and others) alerted on the `#brian` channel that data was missing in the BRIAN gui
30 May 2022	20:30	Bjarke Madsen replied that

is

it seemed related to service problems seen earlier in the day
30 May 2022	21:12	Massimiliano Adamo replied on `#swd-private` that we had raised an issue with

VMWare

VMWare regarding storage device failure.
30 May 2022	23:28	Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down
30 May 2022	12:53

continuous failures writing to Influx, or resolving the hostname:

For the duration of this event, Kapacitor continuously logged failures regarding writing to or communicating with InfluxDB, as below:

May 31

Code BlockMay 31

00:49:08

prod-poller-processor

kapacitord[54933]:

ts=2022-05-31T00:49:08.133Z

lvl=error

msg="failed

to

write

points

to

InfluxDB"

service=kapacitor

task_master=main

task=interface_rates

node=influxdb_out12

err=timeout

May

31

01:26:44

prod-poller-processor

kapacitord[54933]:

ts=2022-05-31T01:26:44.163Z

lvl=error

msg="failed

to

connect

to

InfluxDB,

retrying..."

service=influxdb

cluster=read

err="Get

https://influx-cluster.service.ha.geant.org:8086/ping:

dial

tcp:

lookup

influx-cluster.service.ha.geant.org

on

83.97.93.200:53:

no

such

host"

This means that while Kapacitor was receiving live network counters in real time, the results of the rate calculations weren't being saved to InfluxDB.

31 May 2022

08:12

31 May 2022

02:34-08:11

also lots of

There were many incidents of disk i/o

errros in the logs

failure logged over the duration of the event, indicating filesystem/disk corruption. For example:

May 31

Code Block

language	text

May 31

02:34:59

prod-poller-processor

kernel:

[1754481.561423]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561427]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

02

05

63

f0

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561442]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

149650

(offset

0

size

0

starting

block

702078)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561494]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561497]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

18

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561529]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561532]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

af

dd

b0

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561550]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561553]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

f3

b3

d0

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561567]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1086s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561569]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

aa

75

90

00

18

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561579]

EXT4-fs

warning

(device

dm-4):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

73

(offset

4927488

size

8192

starting

block

1005746

)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561632]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561635]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

b1

90

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561675]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561678]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

48

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561696]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561699]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

68

00

a0

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561758]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561761]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

af

c6

68

00

20

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561789]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561792]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

b4

f0

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561814]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

131114

(offset

0

size

0

starting

block

1029894)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561852]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

131105

(offset

14843904

size

12288

starting

block

261672)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.626924]

JBD2:

Detected

IO

errors

while

flushing

file

data

on

`dm-0-8`
31 May 2022	07:34	Keith Slater took ownership of informing APM's
31 May 2022	08:12	Pete Pedersen stopped the system and fixed the corrupt partition.
31 May 2022	08:26:55	System was rebooted.
31 May 2022	08:26:55	There was a network DNS failure during the boot process and `haproxy` failed to start

because it couldn't resolve

, because prod-inventory-provider01.geant.org and prod-inventory-

provider0x

provider02.geant.org

code

couldn't be resolved:

May

31

08:26:55

prod-poller-processor

haproxy[976]:

[ALERT]

150/082655

(976)

:

parsing

[/etc/haproxy/haproxy.cfg:30]

:

'server

prod-inventory-provider01.geant.org'

:

could

not

resolve

address

'prod-inventory-provider01.geant.org'.

May

31

08:26:55

prod-poller-processor

haproxy[976]:

[ALERT]

150/082655

(976)

:

parsing

[/etc/haproxy/haproxy.cfg:31]

:

'server

prod-inventory-provider02.geant.org'

:

could

not

resolve

address

`'prod-inventory-provider02.geant.org'.`
31 May 2022	08:27:07	`Kapacitor` tasks failed to run because the `haproxy` service wasn't running, for example:

Code Block

May

31

08:27:07

prod-poller-processor

kapacitord[839]:

ts=2022-05-31T08:27:07.962Z

lvl=info

msg="UDF

log"

service=kapacitor

task_master=main

task=service_enrichment

node=inventory_enrichment2

text="urllib3.exceptions.MaxRetryError:

HTTPConnectionPool(host='localhost',

port=8080):

Max

retries

exceeded

with

url:

/poller/interfaces

(Caused

by

NewConnectionError('<urllib3.connection.HTTPConnection

object

at

0x7f749f4a2978>:

Failed

to

establish

a

new

connection:

[Errno

111]

Connection

refused',))"

Since the Kapacitor tasks weren't running, network counters were still not being processed or saved to InfluxDB.

31 May 2022

08:41:11

puppet ran automatically and restarted haproxy.

At this time

dns

DNS resolution was back to normal, and haproxy successfully started

... but

But `Kapacitor` tasks were still in a non-executing state, therefore data was still not being processed.
31 May 2022	09:27:10

manual

Manual restart of Kapacitor. Normal

system behavior restored

BRIAN data processing of real-time data was restored.
31 May 2022	10:39	Sam Roberts copied the

lost

data points lost during the incident from UAT to production `interface_rates` `dscp32_rates` `gwsd_rates` `multicast_rates`
31 May 2022	11:56	Keith Slater informed APMs - BRIAN is back to normal operation.

Proposed Solution

The core issue seems to be related to VMWare and IT need to provide a solution. S.M.A.R.A previously-known issue with the Kapacitor tasks stopping due to unchecked errors meant that the services were not executing for longer than necessaryT. alerts have been found in the vCenter, but monitoring has not been configured to detect these alerts.
This incident suggests that a previously logged technical debt issue (POL1-529), which has been considered medium/low priority, could be prioritized for development:
- fixing this issue could generally help with temporary DNS resolution errors, however the DNS issues were secondary in this incident and fixing this issue wouldn't have prevented the overall outage
while VMWare disk corruption and network dns failures are external events and out of the control of SWD, a further investigation for potential improvement in processing resiliency is described in POL1-607.

Page tree

Versions Compared

Old Version 5

New Version Current

Key

Incident Description

Timeline

Proposed Solution

Page tree

Page History

Versions Compared

Old Version 5

New Version Current

Key

Incident Description

Timeline

Proposed Solution