Incident Description

EMS (via https://events.geant.org), and other services (GÉANT Connectivity Map, Compendium Database), have been unavailable for a few minutes at a time, throughout the day. This was due to PostgreSQL service being unavailable


The reason for degradation:


The impact of this service degradation was that the following services were unavailable:


Incident severity: CRITICAL Intermittent service outage

Data loss: NO

Total duration of incident: 21 hours


Timeline

All times are in UTC

DateTimeDescription

 

13:10:00 

First error in indico.log of PostgreSQL being unavailable:

OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known

 

13:20First user query about EMS login problem (Slack #it)

 

13:24

Service restored and acknowledged by users on Slack #it

(First of many service unavailable then available again periods)

 

13:27

Ian Galpin starts investigating and finds the DNS resolving error: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address

 

16:08

IT confirms that there is a VMWare storage issue, via Massimiliano Adamo on Slack #swd-private

it's a storage issue.

 

20:50

Additional outages occur, IT still working on issue with VMWare


 

23:28

Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down

 

07:33 - 07:46

Pete Pedersen , Massimiliano Adamo , Allen Kong work on restoring AD service by shutting down the FRA node and starting it back up

 

08:20

Ian Galpin can't log into EMS servers, Pete Pedersen , Massimiliano Adamo , Allen Kong looking into it

Allen Kong on #service-issue-30052022:

@massimiliano.adamo I'm looking at prod-event01 and seeing lots of nasty logical block warnings...can you check?

 

08:50

Pete Pedersen and Massimiliano Adamo restarted EMS VMs and ran filesystem checks


09:30

Massimiliano Adamo  and Pete Pedersen recover the primary db server by rebooting the server in recovery mode and running fsck to recover the data the same thing had to be done the witness node . As for the secondary postgres server has shutdown because it had no witness.
on the primary server once the server had rebooted we had to run puppet agent -t to get the primary fully up. 

 

09:45

Ian Galpin tested and verified that service was restored for EMS/Map/CompendiumDB

Mandeep Saini updated #it

Proposed Solutions

  • The core issue seems to be related to VMWare storage and IT need to provide a solution for monitoring the health of the VM storage
  • However, the HA of primary services (EMS, Map, etc) and dependant core services (PostgreSQL etc) needs to be improved:
    • Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
      • If a core service goes down (e.g. no PostgreSQL primary set), that alarm will be hard to spot in the hundreds of subsequent alarms
    • Postgres fail-over failed to work because it required a witness to be active and the witness failed because it is hosted on the same VM cluster as the primary DB server. Thus the proposed solution is to add two more witness servers so we have one on every site (FRA, PRA, LON ) DEVOPS-27 - Getting issue details... STATUS
  • No labels