# Incident Management

Defining incidents Responding to an incident is a disruptive process to a product development, hence it's important for us to have a shared understanding of what incidents is.

Anything that is blocking our users to complete their intent of use are incidents, this may range from:

  1. Our users can't click on book viewing as the button has disappeared
  2. Our server is not serving up users requests properly

Therefore, errors they may crop up without affecting our users should not be considered as incidents. If there are errors in the browser for example that we log, these are not incidents and should be handled differently.

# Monitoring

We need a basic principles and processes that to monitor our system if we have an ongoing incidents.

# Synthetic monitoring

Nothing beats pretending to be the real user and complete a whole journey in our production to know if our production system is working. We'll do this by implementing a SyntheticMonitoring.

# Alerting

When we have an incident in our system, we want to be the first one to know about them, not our users. We'll be the first one to know via our alerting system. The alerts that we'd configure are incidents only alerts -- alerts which signifies if an incident (based on the definition above) has occurred. A failing synthetic transaction therefore is the best alert that we would get. Alerting systems would normally come in different level of priority, for now we'll only focus on the top level priority and not have warning level alert to avoid an alert fatigue to our small team.

# Monitoring dashboard

Even though we want to be the first one to know about incidents, there are incidents that can't be discovered by an automated system. There is a risk where our synthetic monitoring will be able to complete a journey, but not our users. This is where monitoring the trends of our system behaviour becomes important.

There are many metrics to be monitored, we'll focus on the The Four Golden Signals, and leave the rest available without customisation (CloudWatch).

# Support hours and alert escalation

We don't have many users yet, and we only support London offices at the moment. We don't really want to waste our time being on-call when the likelihood of the incidents to happen is low, which calls to two mode of our support hours:

  1. Business as usual: We will support the system on our normal working hours.
  2. Traction moment: We will support the system on an extended hour.

In business as usual mode, we wouldn't need to respond any of the incidents when we're offline. This is the default mode that we'll always be in.

In traction moment mode, we will support the system from more and we'll define the hours on a case by case basis before we leverage a traction channel, for example, paying for marketing campaigns or pay-per-clicks.

The alerts on both mode will always go to Alastair. On traction moment mode, Alastair and Wisen will be standing by off working hours.

# Process

Incidents may induce panic, and causes stress. Stress will cause human errors, therefore when an incident is happening, we will need to have a strict list of things that we can follow to keep calm.

# Pre-incidents

This is the first step of the process when an incident happens. When you get the alert, the first thing that you should do is to validate it. If this is a real alert, proceed to the next step, if it's not, investigate on why the false alert is being raised and fix it.

# During incidents

If the incident is a real incident, keep calm, then:

  1. Start a new thread at Slack's #incidents channel as our Live Incident State Document
  2. Start a new zoom meeting for war room for other people to join.
  3. Write the incidents timeline and discovery in the newly created thread. This is useful for handing over and troubleshooting when various people are working together (and for learning purposes later).
  4. If the issue with our code, is it revert-able? Revert if possible!
  5. If the issue is with AWS, or SaaS vendors, start checking if the problem is on their end
  6. Checking on our Runbook if there's an existing resolution
  7. Keep calm and troubleshoot!

# Post-incidents

It's time to learn from the incidents when they're resolved.

  1. Based on the Slack's thread, start forming the live notes into an Incident State Document.
  2. Start a post-mortem brainstorming session and learn from the incidents.
  3. Write-up the the learnings put it up to upmo.dev
  4. Action on all the action items that come out of the post-mortem, it takes the highest priority and over the product development