Incident Description
EMS (via https://events.geant.org), and other services (GÉANT Connectivity Map, Compendium Database), have been unavailable for a few minutes at a time, throughout the day. This was due to PostgreSQL service being unavailable
The reason for degradation:
- prod-events01.geant.org and prod-events02.geant.org could not resolve the hostname for the PostgreSQL server (prod-postgres.geant.org)
- The first PostgreSQL server (prod-postgres01.geant.org) and the replication witness (prod-postgres-witness.geant.org) failed to start up correctly due to the VMWare storage problems
- Multiple sites in GÉANT's infrastructure experienced service interruption
- The underlying cause seems to be a storage error on VMWare. IT are investigating along with VMWare support
- cf. IT incident: 30052022
The impact of this service degradation was that the following services were unavailable:
Incident severity:
Intermittent service outage
Data loss: 
Total duration of incident: 21 hours
Timeline
All times are in UTC
Date | Time | Description |
---|
| 13:10:00 | First error in indico.log of PostgreSQL being unavailable: OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known |
| 13:20 | First user query about EMS login problem (Slack #it) |
| 13:24 | Service restored and acknowledged by users on Slack #it (First of many service unavailable then available again periods) |
| 13:27 | Ian Galpin starts investigating and finds the DNS resolving error: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address |
| 16:08 | IT confirms that there is a VMWare storage issue, via Massimiliano Adamo on Slack #swd-private it's a storage issue.
|
| 20:50 | Additional outages occur, IT still working on issue with VMWare
|
| 23:28 | Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down |
| 07:33 - 07:46 | Pete Pedersen , Massimiliano Adamo , Allen Kong work on restoring AD service by shutting down the FRA node and starting it back up |
| 08:20 | Ian Galpin can't log into EMS servers, Pete Pedersen , Massimiliano Adamo , Allen Kong looking into it Allen Kong on #service-issue-30052022: @massimiliano.adamo I'm looking at prod-event01 and seeing lots of nasty logical block warnings...can you check?
|
| 08:50 | Pete Pedersen and Massimiliano Adamo restarted EMS VMs and ran filesystem checks |
|
| PETE AND MAX FIXED puppet/postgres PLEASE FILL IN |
| 09:45 | Ian Galpin tested and verified that service was restored for EMS/Map/CompendiumDB Mandeep Saini updated #it |
|
|
|
|
|
|
Proposed Solutions
- The core issue seems to be related to VMWare storage and IT need to provide a solution for monitoring the health of the VM storage
- The HA of primary services (EMS, Map, etc) and dependant core services (PostgreSQL etc) needs to be improved:
- Additional monitoring configuration or setup changes are required to more easily identify the root cause of an outage
- If a core service goes down (e.g. no PostgreSQL primary set), that alarm will be hard to spot in the hundreds of subsequent alarms
- Postgres
- FAILURE: fail-over fail to work because the it required a witness to be active and it had failed because it is the same VM cluster as the primary DB server
- Proposed solution:
- add two more witness servers so we have one on every site
- second suggestion would be to change the way the postgres clients connect so the are cluster aware (ie: all node are listed in connection string and have the targetServerType=primary option) and connect directly to postgres, this will remove the the need for the consul config / intermediate (less parts, less to break)