You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Incident Description

EMS (via https://events.geant.org) has been unavailable for a few minutes at a time, throughout the day.


The reason for degradation:

  • prod-events01.geant.org and prod-events02.geant.org could not resolve the hostname for the PostgreSQL server (prod-postgres.geant.org)
  • Multiple sites in GÉANT's infrastructure experienced service interruption
  • The underlying cause seems to be a storage error on VMWare. IT are investigating along with VMWare support
  • TODO: Add link to IT incident page


The impact of this service degradation was:

  • Users could not access EMS


Incident severity: CRITICAL Intermittent service outage

Data loss: NO

Total duration of incident: 13 hours/On going (as of  22:22 UTC)


Timeline

All times are in UTC

DateTimeDescription

 

13:10:00 

First error in indico.log of PostgreSQL being unavailable:

OperationalError: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address: Name or service not known

 

13:20First user query about EMS login problem (Slack #it)

 

13:24

Service restored and acknowledged by users on Slack #it

 

13:27

Ian Galpin starts investigating and finds the DNS resolving error: (psycopg2.OperationalError) could not translate host name "prod-postgres.geant.org" to address

 

16:08

IT confirms that there is a VMWare storage issue, via Massimiliano Adamo on Slack #swd-private

it's a storage issue.

 

20:50

Additional outages occur, IT still working on issue with VMWare





Proposed Solution

  • The core issue seems to be related to VMWare and IT need to provide a solution
  • No labels