Incident Description

EMS (via https://events.geant.org), and other services (GÉANT Connectivity Map, Compendium Database), have been unavailable for a few minutes at a time, throughout the day. This was due to PostgreSQL service being unavailable

The reason for degradation:

prod-events01.geant.org and prod-events02.geant.org could not resolve the hostname for the PostgreSQL server (prod-postgres.geant.org)
The first PostgreSQL server (prod-postgres01.geant.org) and the replication witness (prod-postgres-witness.geant.org) failed to start up correctly due to the VMWare storage problems
- This resulted in there being no PostgreSQL primary node and therefore the DNS entry prod-postgres-witness.geant.org->primary.prod-postgres.service.ha.geant.net. not resolving to an actual host
Multiple sites in GÉANT's infrastructure experienced service interruption
The underlying cause seems to be a storage error on VMWare. IT are investigating along with VMWare support
cf. IT incident: 30052022

The impact of this service degradation was that the following services were unavailable:

Incident severity: Intermittent service outage

Data loss:

Total duration of incident: 21 hours

Timeline

All times are in UTC

Date	Time	Description
30 Apr 2022	13:10:00	First error in indico.log of PostgreSQL being unavailable: OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known
30 Apr 2022	13:20	First user query about EMS login problem (Slack #it)
30 Apr 2022	13:24	Service restored and acknowledged by users on Slack #it (First of many service unavailable then available again periods)
30 Apr 2022	13:27	Ian Galpin starts investigating and finds the DNS resolving error: `(psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address`
30 Apr 2022	16:08	IT confirms that there is a VMWare storage issue, via Massimiliano Adamo on Slack #swd-private it's a storage issue.
30 Apr 2022	20:50	Additional outages occur, IT still working on issue with VMWare
30 May 2022	23:28	Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down
31 May 2022	07:33 - 07:46	Pete Pedersen , Massimiliano Adamo , Allen Kong work on restoring AD service by shutting down the FRA node and starting it back up
31 May 2022	08:20	Ian Galpin can't log into EMS servers, Pete Pedersen , Massimiliano Adamo , Allen Kong looking into it Allen Kong on #service-issue-30052022: @massimiliano.adamo I'm looking at prod-event01 and seeing lots of nasty logical block warnings...can you check?
31 May 2022	08:50	Pete Pedersen and Massimiliano Adamo restarted EMS VMs and ran filesystem checks
		PETE AND MAX FIXED puppet/postgres PLEASE FILL IN
31 May 2022	09:45	Ian Galpin tested and verified that service was restored for EMS/Map/CompendiumDB Mandeep Saini updated #it

Proposed Solutions

The core issue seems to be related to VMWare storage and IT need to provide a solution for monitoring the health of the VM storage
The HA of primary services (EMS, Map, etc) and dependant core services (PostgreSQL etc) needs to be improved:
- Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
  - If a core service goes down (e.g. no PostgreSQL primary set), that alarm will be hard to spot in the hundreds of subsequent alarms
Postgres
- FAILURE: fail-over fail to work because the it required a witness to be active and it had failed because it is the same VM cluster as the primary DB server
- Proposed solution:
  - add two more witness servers so we have one on every site
  - second suggestion would be to change the way the postgres clients connect so the are cluster aware (ie: all node are listed in connection string and have the targetServerType=primary option) and connect directly to postgres, this will remove the the need for the consul config / intermediate (less parts, less to break)