Skip to main content

Incident Process

This document describes our incident handling process for network, operations and security issues.

This document assumes the user has a PagerDuty account and is part of the Analytical Platform team. To use the in Slack features the user must authorise the PagerDuty application in Slack.

1. Confirm that the event constitutes an incident

If this incident constitutes a security breach, you must also report it here

We define an incident as an event which:

  • is unplanned, and
  • impacts end users or our direct users (developers and engineers), or
  • degrades user-facing services, or
  • increases risk to production services

If this event does not constitute an incident, the appropriate response is ask a user to raise an support issue or raise a ticket in our GitHub repository, however if the issue is security related, it is raised here.

Once you are confident that you have an incident, declare it as such.

2. Declare the incident

Declaring an incident begins with updating Analytical Platform external status page.

You will be presented with a form for you to complete: Create a new PagerDuty incident

Impacted Service - choose all that apply from the following depending on the type of incident:

  • Cloud Platform
  • Modernisation Platform
  • Analytical Platform Ingestion
  • Analytical Platform Storage
  • Analytical Platform Compute
  • Analytical Platform Networking
  • Analytical Platform Identity

Which Impact to assign?

Impact Description
Major The whole platform is down or unavailable, all user applications are unavailable
Minor Part of the platform is down or unavailable, some user applications are impacted or unavailable
All Good Incident has been resolved, all service is restored

Select Time till Next Update. Note that users should be kept regularly updated especially during major outages, and choose accordingly.

Tick the Send update notifications to subscribers box.

Click Post button once you are happy with the form. This will update the external page. The PagerDuty Status Page slack integration will then automatically post the status in the analytical-platform-support and ask-data-engineering channels.

3. Managing an Incident

Incident communications must be maintained in a single Slack thread in the main analytical-platform team channel for maximum visibility.

3.1 Starting a Thread

Incident Slack thread should have a clear starting message and a thread emoji to ensure consistency. Ex: “Airflow Autoscaling Incident :thread:”

3.2 Capturing notes

Please note, communications in the Slack thread could later be used to reconstruct the timeline for post-incident report (if required) therefore it is critical to keep all comms to the single incident thread.

Regular updates to the external status page should be posted to allow stakeholder visibility.

4. Fix the problem

Please bear in mind that not every incident requires the whole team to be involved (even if they all want to join in).

Log a support ticket if necessary.

If the incident cannot be resolved within the team or if the issue lies with a third-party, log a support ticket with the third-party.

For AWS support, log a call in the AWS account affected.

3rd Party How to log a support ticket Escalation process
AWS Creating a support case On the case, or post in #ext-aws

5. End the incident

The incident is resolved once the user is no longer facing issues. This may be a temporary fix, in which case an issue should be created to put a permanent fix in place.

Update the external PagerDuty status page with the resolution.

This marks the official end of the incident.

6. Post-incident procedure

After the incident is resolved:

  • A new incident report should be created. The notes generated during the incident management in Slack can be used for the incident timeline records.
  • A blameless post mortem meeting should be scheduled to identify any processes that need to be improved
  • A runbook for how to fix this issue should be published
This page was last reviewed on 12 November 2024. It needs to be reviewed again on 12 February 2025 by the page owner #analytical-platform .