You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 38 Current »

Incident Management Process (draft)

Establish who are the affected users and stakeholders

  • A starting input for this list can be the list of stakeholders here: Service Catalogue

Communicate information about the incident to the affected users and stakeholders

  • Do this before taking any other action

The relevant team members should look into the issue

  • First priority is to restore service

Create an Incident Report

  • Start with one of the previous Incident Reports as a template: Incidents
  • Save the new Incident Report here as a new child page
  • Add the incident page to the Index table at the bottom of the page
  • Basic information:
    • Timeline (how/when it was identified, when service was restored, etc)
    • Other information
    • Optional future mitigations
  • If it's taking a long time to resolve the issue we must update the users every 3-4 hours, Linda Ness can probably help/advise with this.

Index

Severity

  • CRITICAL Complete service outage
  • MED Partial service degradation
  • LOW Virtually no user impact

Data Loss

  • YES Data has been lost
  • NO No data was lost
ServiceStart DateEnd DateSeverityData LossIncident Page
WordPress

 

 

CRITICAL

YES

Production wordpress site outage 2018-02-13
WordPress

 

 

CRITICAL

NO

Production wordpress site outage 2018-02-22
WordPress

 

 

CRITICAL

NO

Production wordpress site outage 2018-03-25
Dashboard

 

 

CRITICAL

NO

Production Dashboard Outage 2018-06-18
Staff IDP

 

 

CRITICAL

NO

Sympa

 

 

CRITICAL

NO

Production Sympa Service Outage 2018-08-03
Dashboard

 

 

CRITICAL

YES

Production Dashboard Outage 2018-07-11
DNS

 

 

CRITICAL

NO

DNS Outage 2019-02-27
SharePoint

 

 

CRITICAL

NO

SharePoint Outage 2019-02-07
Dashboard

 

 

CRITICAL

NO

Production Dashboard Outage 2019-07-16
Dashboard

 

 

CRITICAL

NO

Production Dashboard Outage 2019-07-27
SharePoint

 

 

CRITICAL

NO

SharePoint Outage 2020-01-08
SharePoint

 

 

MED

NO

RSS Feed in Jobs page Geant.org was down - 17/01/2020
BRIAN

 

CRITICAL

YES

Brian Outage 2020-01-26
Cacti

 

 

CRITICAL

YES

Cacti production incident - 06-03-2020
Cacti

 

 

CRITICAL

YES

Cacti Production Instance - July 2020
HAProxy

 

 

CRITICAL

NO

Haproxy Outage 2021-03-17
ProxySQL

 

 

CRITICAL

CRITICAL

ProxySQL Outage 2021-07-12
EMS

 

 

CRITICAL

NO

EMS - 2022-03-14 - Service Outage
EMS(DNS)

 

 

MED

NO

EMS - 2022-04-20 - Service Degradation
Dashboard

 

 

CRITICAL

YES

Production Dashboard - 2022-05-15 - Service Outage
PostgreSQL(VMWare)

 

CRITICAL

NO

PostgreSQL - 2022-05-30 - Wide-scale Service Outage
BRIAN

 

 

CRITICAL

NO

BRIAN - 2022-05-30/31 - Service Outage
BRIAN

 

 

CRITICAL

YES

BRIAN - 2023-02-26/27 - Service Outage
BRIAN

 

 

MED

NO

BRIAN 2023-11-16/17 Data Collection Outage

All Incident Documents

  • No labels