EMS (via https://events.geant.org), and other services (GÉANT Connectivity Map, Compendium Database), have been unavailable for a few minutes at a time, throughout the day. This was due to PostgreSQL service being unavailable
The reason for degradation:
The impact of this service degradation was that the following services were unavailable:
Incident severity: Intermittent service outage
Data loss:
Total duration of incident: 21 hours
All times are in UTC
Date | Time | Description |
---|---|---|
| 13:10:00 | First error in indico.log of PostgreSQL being unavailable: OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known |
| 13:20 | First user query about EMS login problem (Slack #it) |
| 13:24 | Service restored and acknowledged by users on Slack #it (First of many service unavailable then available again periods) |
| 13:27 | Ian Galpin starts investigating and finds the DNS resolving error: |
| 16:08 | IT confirms that there is a VMWare storage issue, via
|
| 20:50 | Additional outages occur, IT still working on issue with VMWare |
| 23:28 | Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down |
| 07:33 - 07:46 | Pete Pedersen , Massimiliano Adamo , Allen Kong work on restoring AD service by shutting down the FRA node and starting it back up |
| 08:20 | Ian Galpin can't log into EMS servers, Pete Pedersen , Massimiliano Adamo , Allen Kong looking into it Allen Kong on #service-issue-30052022:
|
| 08:50 | Pete Pedersen and Massimiliano Adamo restarted EMS VMs and ran filesystem checks |
09:30 | Massimiliano Adamo and Pete Pedersen recover the primary db server by rebooting the server in recovery mode and running fsck to recover the data the same thing had to be done the witness node . As for the secondary postgres server has shutdown because it had no witness. | |
| 09:45 | Ian Galpin tested and verified that service was restored for EMS/Map/CompendiumDB Mandeep Saini updated #it |