Incident Management Process (draft)

Establish who are the affected users and stakeholders

A starting input for this list can be the list of stakeholders here: Service Catalogue

Communicate information about the incident to the affected users and stakeholders

Do this before taking any other action

The relevant team members should look into the issue

First priority is to restore service

Create an Incident Report

Start with one of the previous Incident Reports as a template: Incidents
Save the new Incident Report here as a new child page
Add the incident page to the Index table at the bottom of the page
Basic information:
- Timeline (how/when it was identified, when service was restored, etc)
- Other information
- Optional future mitigations
If it's taking a long time to resolve the issue we must update the users every 3-4 hours, Linda Ness can probably help/advise with this.

Index

Severity

CRITICAL Complete service outage
MED Partial service degradation
LOW Virtually no user impact

Data Loss

YES Data has been lost
NO No data was lost

Service	Start Date	End Date	Severity	Data Loss	Incident Page
WordPress	13 Feb 2018	13 Feb 2018	CRITICAL	YES	Production wordpress site outage 2018-02-13
WordPress	22 Feb 2018	22 Feb 2018	CRITICAL	NO	Production wordpress site outage 2018-02-22
WordPress	25 Mar 2018	25 Mar 2018	CRITICAL	NO	Production wordpress site outage 2018-03-25
Dashboard	18 Jun 2018	20 Jun 2018	CRITICAL	NO	Production Dashboard Outage 2018-06-18
Staff IDP	03 Aug 2018	03 Aug 2016	CRITICAL	NO	Production Staff IDP & BoD Service Outage 2018-08-03
Sympa	03 Aug 2018	06 Aug 2018	CRITICAL	NO	Production Sympa Service Outage 2018-08-03
Dashboard	11 Jul 2018	12 Jul 2018	CRITICAL	YES	Production Dashboard Outage 2018-07-11
DNS	27 Feb 2019	27 Feb 2019	CRITICAL	NO	DNS Outage 2019-02-27
SharePoint	07 Feb 2019	07 Feb 2019	CRITICAL	NO	SharePoint Outage 2019-02-07
Dashboard	16 Jul 2019	17 Jul 2019	CRITICAL	NO	Production Dashboard Outage 2019-07-16
Dashboard	28 Jul 2019	28 Jul 2019	CRITICAL	NO	Production Dashboard Outage 2019-07-27
SharePoint	08 Jan 2020	08 Jan 2020	CRITICAL	NO	SharePoint Outage 2020-01-08
SharePoint	17 Jan 2020	17 Jan 2020	MED	NO	RSS Feed in Jobs page Geant.org was down - 17/01/2020
BRIAN	27 Jan 2020	27 Jan 2020	CRITICAL	YES	Brian Outage 2020-01-26
Cacti	06 Mar 2020	10 Mar 2020	CRITICAL	YES	Cacti production incident - 06-03-2020
Cacti	22 Jul 2020	29 Jul 2020	CRITICAL	YES	Cacti Production Instance - July 2020
HAProxy	17 Apr 2021	17 Apr 2021	CRITICAL	NO	Haproxy Outage 2021-03-17
ProxySQL	10 Jul 2021	12 Jul 2021	CRITICAL	CRITICAL	ProxySQL Outage 2021-07-12
EMS	12 Mar 2022	14 Mar 2022	CRITICAL	NO	EMS - 2022-03-14 - Service Outage
EMS(DNS)	20 Apr 2022	20 Apr 2022	MED	NO	EMS - 2022-04-20 - Service Degradation
Dashboard	15 May 2022	16 May 2022	CRITICAL	YES	Production Dashboard - 2022-05-15 - Service Outage
PostgreSQL(VMWare)	30 May 2022	31 May 2022	CRITICAL	NO	PostgreSQL - 2022-05-30 - Wide-scale Service Outage
BRIAN	30 May 2022	31 May 2022	CRITICAL	NO	BRIAN - 2022-05-30/31 - Service Outage
BRIAN	26 Apr 2023	27 Feb 2023	CRITICAL	YES	BRIAN - 2023-02-26/27 - Service Outage
BRIAN	16 Nov 2023	17 Nov 2023	MED	NO	BRIAN 2023-11-16/17 Data Collection Outage

All Incident Documents

Title
No content found.

Page tree

Incidents

Incident Management Process (draft)

Establish who are the affected users and stakeholders

Communicate information about the incident to the affected users and stakeholders

The relevant team members should look into the issue

Create an Incident Report

Index

All Incident Documents