Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Incident Description

New BRIAN bitrate traffic was not computed or saved available for approximately 21 hours.

...

The reason for degradation:

  • cf. IT incident: 30052022
  • Local system partition corruption
  • Failure to connect or write data to InfluxDB
  • Local system partition errors
  • cf. IT incident: 30052022


The impact of this service degradation was:

...

Incident severity: 

Status
colourRed
titleCRITICAL
 Intermittent Temporary service outage

Data loss: 

Status
subtletrue
colourBlue
titleNo

Total duration of incident: 21 hours/On going (as of  22:22 UTC)


Timeline

All times are in UTC

DateTime (UTC)Description

 

12:52:37
the

The first evidence of this incident appeared in the logs of prod-poller-processor.geant.org

Code BlockMay 30

. remove_spikes_interface_rates is one of several stream functions in the data processing pipeline required for the data displayed in BRIAN.


May 30 12:52:37

prod-poller-processor

kapacitord[124994]:

ts=2022-05-30T12:52:37.802Z

lvl=error

msg="failed

to

write

points

to

InfluxDB"

service=kapacitor

task_master=main

task=remove_spikes_gwsd_rates

node=influxdb_out3

err=timeout

May

30

12:52:38

prod-poller-processor

kapacitord[124994]:

ts=2022-05-30T12:52:38.069Z

lvl=error

msg="encountered

error"

service=kapacitor

task_master=main

task=remove_spikes_interface_rates

node=remove_spikes2

err="keepalive

timedout,

last

keepalive

received

was:

2022-05-30

12:52:28.069298439

+0000

UTC"

 

afternoon

Several performance issues started

to become apparent

being reported across the network:

ems
  • EMS was failing to resolve the influx cluster hostnames
  • puppet was failing or taking a very long time to complete on many vm's

 

19:08

Keith Slater (and others) alerted on the #brian channel that data was missing in the BRIAN gui

 

20:30

Bjarke Madsen replied that

is

it seemed related to service problems seen earlier in the day

 

21:12

Massimiliano Adamo replied on #swd-private that we had raised an issue with

VMWare 

VMWare regarding storage device failure.

 

23:28

Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down

 

12:53
continuous failures writing to Influx, or resolving the hostname:

For the duration of this event, Kapacitor continuously logged failures regarding writing to or communicating with InfluxDB, as below:


May 31

Code BlockMay 31

00:49:08

prod-poller-processor

kapacitord[54933]:

ts=2022-05-31T00:49:08.133Z

lvl=error

msg="failed

to

write

points

to

InfluxDB"

service=kapacitor

task_master=main

task=interface_rates

node=influxdb_out12

err=timeout

May

31

01:26:44

prod-poller-processor

kapacitord[54933]:

ts=2022-05-31T01:26:44.163Z

lvl=error

msg="failed

to

connect

to

InfluxDB,

retrying..."

service=influxdb

cluster=read

err="Get

https://influx-cluster.service.ha.geant.org:8086/ping:

dial

tcp:

lookup

influx-cluster.service.ha.geant.org

on

83.97.93.200:53:

no

such

host"


This means that while Kapacitor was receiving live network counters in real time, the results of the rate calculations weren't being saved to InfluxDB.

 

08:12

 

02:34-08:11
also lots of

There were many incidents of disk i/o

errros in the logs

failure logged over the duration of the event, indicating filesystem/disk corruption.  For example:


May 31

Code Block
languagetext
May 31

02:34:59

prod-poller-processor

kernel:

[1754481.561423]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561427]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

02

05

63

f0

00

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561442]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

149650

(offset

0

size

0

starting

block

702078)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561494]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561497]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

18

00

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561529]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561532]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

af

dd

b0

00

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561550]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561553]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

f3

b3

d0

00

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561567]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1086s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561569]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

aa

75

90

00

00

18

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561579]

EXT4-fs

warning

(device

dm-4):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

73

(offset

4927488

size

8192

starting

block

1005746

)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561632]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561635]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

b1

90

00

00

08

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561675]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561678]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

48

00

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561696]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561699]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

d3

68

00

00

a0

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561758]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561761]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

af

c6

68

00

00

20

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561789]

sd

0:0:0:0:

[sda]

FAILED

Result:

hostbyte=DID_OK

driverbyte=DRIVER_OK

cmd_age=1084s

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561792]

sd

0:0:0:0:

[sda]

CDB:

Write(10)

2a

00

01

ef

b4

f0

00

00

10

00

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561814]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

131114

(offset

0

size

0

starting

block

1029894)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.561852]

EXT4-fs

warning

(device

dm-0):

ext4_end_bio:302:

I/O

error

-5

writing

to

inode

131105

(offset

14843904

size

12288

starting

block

261672)

May

31

02:34:59

prod-poller-processor

kernel:

[1754481.626924]

JBD2:

Detected

IO

errors

while

flushing

file

data

on

dm-0-8

 

07:34
Keith Slater took ownership of informing APM's

 

08:12

Pete Pedersen stopped the system and fixed the corrupt partition.

 

08:26:55

System was rebooted.

 

08:26:55

There was a network DNS failure during the boot process and haproxy failed to start

because it couldn't resolve

, because prod-inventory-provider01.geant.org and prod-inventory-

provider0x

provider02.geant.org

code

couldn't be resolved:


May
31
08:26:55
prod-poller-processor
haproxy[976]:
[ALERT]
150/082655
(976)
:
parsing
[/etc/haproxy/haproxy.cfg:30]
:
'server
:
could
not
resolve
address
May
31
08:26:55
prod-poller-processor
haproxy[976]:
[ALERT]
150/082655
(976)
:
parsing
[/etc/haproxy/haproxy.cfg:31]
:
'server
:
could
not
resolve
address

 

08:27:07

Kapacitor tasks failed to run because the haproxy service wasn't running, for example:

Code Block

May

31

08:27:07

prod-poller-processor

kapacitord[839]:

ts=2022-05-31T08:27:07.962Z

lvl=info

msg="UDF

log"

service=kapacitor

task_master=main

task=service_enrichment

node=inventory_enrichment2

text="urllib3.exceptions.MaxRetryError:

HTTPConnectionPool(host='localhost',

port=8080):

Max

retries

exceeded

with

url:

/poller/interfaces

(Caused

by

NewConnectionError('<urllib3.connection.HTTPConnection

object

at

0x7f749f4a2978>:

Failed

to

establish

a

new

connection:

[Errno

111]

Connection

refused',))"


Since the Kapacitor tasks weren't running, network counters were still not being processed or saved to InfluxDB.

 

08:41:11

puppet ran automatically and restarted haproxy.

At this time

dns

DNS resolution was back to normal, and haproxy successfully started

... but

But Kapacitor tasks were still in a non-executing state, therefore data was still not being processed.

 

09:27:10
manual
Manual restart of Kapacitor.  Normal
system behavior restored
BRIAN data processing of real-time data was restored.

 

10:39

Sam Roberts copied the

lost

data points lost during the incident from UAT to production

  • interface_rates
  • dscp32_rates
  • gwsd_rates
  • multicast_rates
31 May 202211:56

Keith Slater informed APMs - BRIAN is back to normal operation.

Proposed Solution

  • The core issue seems to be related to VMWare and IT need to provide a solution. S.M.A.R.A previously-known issue with the Kapacitor tasks stopping due to unchecked errors meant that the services were not executing for longer than necessaryT. alerts have been found in the vCenter, but monitoring has not been configured to detect these alerts.
  • This incident suggests that a previously logged technical debt issue (POL1-529), which has been considered medium/low priority, could be prioritized for development:
    • fixing this issue could generally help with temporary DNS resolution errors, however the DNS issues were secondary in this incident and fixing this issue wouldn't have prevented the overall outage 
  • while VMWare disk corruption and network dns failures are external events and out of the control of SWD, a further investigation for potential improvement in processing resiliency is described in POL1-607.