Incident Description

EMS service was not running on either prod-events01.geant.org or prod-events02.geant.org which resulted in EMS service not being available.


The impact of this service degradation was:


Incident severity:  Complete service outage

Data loss: 

Total duration of incident: 2 days


Timeline

All times are in UTC

DateTimeDescription

 

06:20

According to the puppet and systemd logs, puppet tried to re-install EMS on the prod-events01.geant.org and prod-events02.geant.org 

The PIP re-install of EMS resulted in partial EMS directories in the python virtual environment. This is explained in the following Stack Overflow page: https://stackoverflow.com/questions/55565760/anaconda-python-site-packages-subfolders-with-tilde-in-name-what-are-they

 

08:18User enquired on Slack (#general) if EMS is down

 

08:30 - 09:12

Ian Galpin investigated and resolved the issue

  • Determined that the venv for EMS was broken
  • Disabled puppet
  • Fixed virtualenv
  • Restarted services

 

09:12

Service was restored and Ian Galpin informed the user on the Slack channel 

Proposed Solution