Incident Description

EMS (via https://events.geant.org), and other services, have been unavailable for a few minutes at a time, throughout the day. This was due to PostgreSQL service being unavailable

The reason for degradation:

prod-events01.geant.org and prod-events02.geant.org could not resolve the hostname for the PostgreSQL server (prod-postgres.geant.org)
The first PostgreSQL server (prod-postgres01.geant.org) and the replication witness (prod-postgres-witness.geant.org) failed to start up correctly due to the VMWare storage problems
- This resulted in there being no PostgreSQL primary node and therefore the DNS entry prod-postgres-witness.geant.org->primary.prod-postgres.service.ha.geant.net. not resolving to an actual host
Multiple sites in GÉANT's infrastructure experienced service interruption
The underlying cause seems to be a storage error on VMWare. IT are investigating along with VMWare support
cf. IT incident: 30052022

The impact of this service degradation was that the following services were unavailable:

Incident severity: CRITICAL Intermittent service outage

Data loss: NO

Total duration of incident: 21 hours

Timeline

All times are in UTC

Date	Time	Description
30 Apr 2022	13:10:00	First error in indico.log of PostgreSQL being unavailable: OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known
30 Apr 2022	13:20	First user query about EMS login problem (Slack #it)
30 Apr 2022	13:24	Service restored and acknowledged by users on Slack #it (First of many service unavailable then available again periods)
30 Apr 2022	13:27	Ian Galpin starts investigating and finds the DNS resolving error: `(psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address`
30 Apr 2022	16:08	IT confirms that there is a VMWare storage issue, via Massimiliano Adamo on Slack #swd-private it's a storage issue.
30 Apr 2022	20:50	Additional outages occur, IT still working on issue with VMWare
30 May 2022	23:28	Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down
31 May 2022	07:33 - 07:46	Pete Pedersen , Massimiliano Adamo , Allen Kong work on restoring AD service by shutting down the FRA node and starting it back up
31 May 2022	08:20	Ian Galpin can't log into EMS servers, Pete Pedersen , Massimiliano Adamo , Allen Kong looking into it Allen Kong on #service-issue-30052022: @massimiliano.adamo I'm looking at prod-event01 and seeing lots of nasty logical block warnings...can you check?
31 May 2022	08:50	Pete Pedersen and Massimiliano Adamo restarted EMS VMs and ran filesystem checks
		PETE AND MAX FIXED puppet/postgres PLEASE FILL IN
31 May 2022	09:45	Ian Galpin tested and verified that service was restored for EMS/Map/CompendiumDB Mandeep Saini updated #it

Proposed Solutions

The core issue seems to be related to VMWare storage and IT need to provide a solution for monitoring the health of the VM storage
The HA of primary services (EMS, Map, etc) and dependant core services (PostgreSQL etc) needs to be improved:
- Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
  - If a core service goes down (e.g. no PostgreSQL primary set), that alarm will be hard to spot in the hundreds of subsequent alarms
PETE AND MAX TO PLEASE FILL IN

Page tree

PostgreSQL - 2022-05-30 - Wide-scale Service Outage

Incident Description

Timeline

Proposed Solutions