Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Incident Description

EMS (via https://events.geant.org) has been unavailable for a few minutes at a time, throughout the day.

...

Total duration of incident: 13 hours/On going (as of  22:22 UTC)


Timeline

All times are in UTC

DateTimeDescription

 

13:10:00 

First error in indico.log of PostgreSQL being unavailable:

OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known

 

13:20First user query about EMS login problem (Slack #it)

 

13:24

Service restored and acknowledged by users on Slack #it

 

13:27

Ian Galpin starts investigating and finds the DNS resolving error: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address

 

16:08

IT confirms that there is a VMWare storage issue, via Massimiliano Adamo on Slack #swd-private

it's a storage issue.

 

20:50

Additional outages occur, IT still working on issue with VMWare





Proposed Solution

  • The core issue seems to be related to VMWare and IT need to provide a solution