You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Incident Description

EMS (via https://events.geant.org), and other services (GÉANT Connectivity Map, Compendium Database), have been unavailable for a few minutes at a time, throughout the day. This was due to PostgreSQL service being unavailable


The reason for degradation:


The impact of this service degradation was that the following services were unavailable:


Incident severity: CRITICAL Intermittent service outage

Data loss: NO

Total duration of incident: 21 hours


Timeline

All times are in UTC

DateTimeDescription

 

13:10:00 

First error in indico.log of PostgreSQL being unavailable:

OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known

 

13:20First user query about EMS login problem (Slack #it)

 

13:24

Service restored and acknowledged by users on Slack #it

(First of many service unavailable then available again periods)

 

13:27

Ian Galpin starts investigating and finds the DNS resolving error: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address

 

16:08

IT confirms that there is a VMWare storage issue, via Massimiliano Adamo on Slack #swd-private

it's a storage issue.

 

20:50

Additional outages occur, IT still working on issue with VMWare


 

23:28

Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down

 

07:33 - 07:46

Pete Pedersen , Massimiliano Adamo , Allen Kong work on restoring AD service by shutting down the FRA node and starting it back up

 

08:20

Ian Galpin can't log into EMS servers, Pete Pedersen , Massimiliano Adamo , Allen Kong looking into it

Allen Kong on #service-issue-30052022:

@massimiliano.adamo I'm looking at prod-event01 and seeing lots of nasty logical block warnings...can you check?

 

08:50

Pete Pedersen and Massimiliano Adamo restarted EMS VMs and ran filesystem checks



PETE AND MAX FIXED puppet/postgres PLEASE FILL IN

 

09:45

Ian Galpin tested and verified that service was restored for EMS/Map/CompendiumDB

Mandeep Saini updated #it







Proposed Solutions

  • The core issue seems to be related to VMWare storage and IT need to provide a solution for monitoring the health of the VM storage
  • The HA of primary services (EMS, Map, etc) and dependant core services (PostgreSQL etc) needs to be improved:
    • Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
      • If a core service goes down (e.g. no PostgreSQL primary set), that alarm will be hard to spot in the hundreds of subsequent alarms
  • Postgres 
    • FAILURE: fail-over fail to work because the it required a witness to be active and it had failed because it is the same VM cluster as the primary DB server
    • Proposed solution:
      • add two more witness servers so we have one on every site
      • second suggestion would be to change the way the postgres clients connect so the are cluster aware (ie: all node are listed in connection string and have the targetServerType=primary option) and connect directly to postgres, this will remove the the need for the consul config / intermediate (less parts, less to break) 

  • No labels