EMS service was not running on either prod-events01.geant.org or prod-events02.geant.org
which resulted in EMS service not being available.
The impact of this service degradation was:
Incident severity: Complete service outage
Data loss:
Total duration of incident: 2 days
All times are in UTC
Date | Time | Description |
---|---|---|
| 06:20 | According to the puppet and systemd logs, puppet tried to re-install EMS on the The PIP re-install of EMS resulted in partial EMS directories in the python virtual environment. This is explained in the following Stack Overflow page: https://stackoverflow.com/questions/55565760/anaconda-python-site-packages-subfolders-with-tilde-in-name-what-are-they |
| 08:18 | User enquired on Slack (#general) if EMS is down |
| 08:30 - 09:12 | Ian Galpin investigated and resolved the issue
|
| 09:12 | Service was restored and Ian Galpin informed the user on the Slack channel |
prod-events01.geant.org and prod-events02.geant.org
), HAProxy returns a generic 'Maintenance' page