You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 19 Next »

Incident Management Process (draft)


Establish who are the affected users and stakeholders

  • A starting input for this list can be the list of stakeholders here: Service Catalogue


Communicate information about the incident to the affected users and stakeholders

  • Do this before taking any other action


The relevant team members should look into the issue

  • First priority is to restore service


Create an Incident Report

  • Start with one of the previous Incident Reports as a template: Incidents
  • Save the new Incident Report here as a new child page
  • Basic information:
    • Timeline (how/when it was identified, when service was restored, etc)
    • Other information
    • Optional future mitigations
  • If it's taking a long time to resolve the issue we must update the users every 3-4 hours, Linda Ness can probably help/advise with this.


Index

Severity

  • CRITICAL Complete service outage
  • MED Partial service degradation
  • LOW Virtually no user impact

Data Loss

  • YES Data has been lost
  • NO No data was lost
ServiceStart DateEnd DateSeverityData LossIncident Page
DNS

 

 

CRITICAL

NO

DNS Outage 2019-02-27
SharePoint

 

 

CRITICAL

NO

SharePoint Outage 2019-02-07
SharePoint

 

 

CRITICAL

NO

SharePoint Outage 2020-01-08
SharePoint

 

 

MED

NO

RSS Feed in Jobs page Geant.org was down - 17/01/2020
BRIAN

 

CRITICAL

YES

Brian Outage 2020-01-26
Cacti

 

 

CRITICAL

YES

Cacti production incident - 06-03-2020
Cacti

 

 

CRITICAL

YES

Cacti Production Instance - July 2020
HAProxy

 

 

CRITICAL

NO

Haproxy Outage 2021-03-17
ProxySQL

 

 

CRITICAL

YES

ProxySQL Outage 2021-07-12
EMS

 

 

CRITICAL

NO

EMS - 2022-03-14 - Service Outage
EMS/DNS

 

 

MED

NO

EMS - 2022-04-20 - Service Degradation












  • No labels