devops

Incident Response & SRE Basics

On-call responsibility, incident lifecycle, postmortems, runbooks, and reliability mindset

On-Call Responsibility

Being on-call means you are accountable for production systems during your rotation. It’s a skill — not just “having a phone.”

What On-Call Actually Means

You’re the first line of defense for production incidents during your rotation
You may be paged at any time — 3am, holidays, weekends
Your goal: restore service as fast as possible, investigate causes after
You are NOT expected to know everything — you ARE expected to escalate intelligently

Good On-Call Hygiene

Before your rotation:

Read recent incident reports and know what broke recently
Know the escalation path (who to call if you’re stuck)
Make sure your laptop, VPN, and access credentials work
Test that you can actually receive pages

During an incident:

Acknowledge the alert quickly (stops escalation timers)
Communicate status early and often (even if it’s “I’m investigating”)
Focus on mitigation first, root cause second
Don’t work alone on a Sev-1 — pull in help

After your rotation:

Write runbooks for things that were hard to debug
Fix flaky alerts that paged for non-issues

Incident Lifecycle

Detection → Triage → Mitigation → Resolution → Postmortem

1. Detection

How you find out:

Automated alert from monitoring (best)
Customer reports (worst — means you failed to detect it yourself)
Teammate notice

Alert fires → PagerDuty/OpsGenie → Page on-call → On-call acknowledges

2. Triage

Quickly answer:

Is this real or a false positive?
How many users are affected?
What is the blast radius?
What severity level is this?

Severity levels (example):

Sev	Impact	Response time
Sev-1	Complete outage, many users	Immediate, all hands
Sev-2	Major feature down, some users	<15 minutes
Sev-3	Minor feature impacted	<1 hour
Sev-4	Degraded performance	Next business day

3. Mitigation (Stop the Bleeding)

Goal: restore service, not find root cause. These are different.

# Common mitigation options:
# 1. Rollback the recent deployment
kubectl rollout undo deployment/myapp
helm rollback myapp 5

# 2. Scale up to handle load
kubectl scale deployment myapp --replicas=20

# 3. Restart stuck pods
kubectl rollout restart deployment/myapp

# 4. Switch traffic to backup region
# (update DNS, load balancer weights)

# 5. Enable maintenance mode / feature flag
# Flip a feature flag to disable broken feature

# 6. Restore database backup (last resort)
aws rds restore-db-instance-from-db-snapshot --db-snapshot-identifier ...

4. Resolution

Service is restored. Now:

Write a brief status update to stakeholders
Confirm metrics are back to normal
Verify the fix held for 15-30 minutes before declaring resolved

5. Postmortem

Written record of what happened, why, and what we’re doing about it.

Postmortems

A postmortem is a blameless analysis of an incident. The goal is to learn, not to punish.

Why Blameless?

If people fear blame, they hide information, don’t raise concerns, and learn nothing. A blameless culture means:

People are replaced by processes — “What process would have caught this?”
No names in the root cause — “An operator deleted the database” not “John deleted the database”
Everyone who was involved contributes to the analysis

Postmortem Template

## Incident Summary

**Date**: 2024-01-15
**Duration**: 1 hour 23 minutes (10:15 - 11:38 UTC)
**Severity**: Sev-2
**Impact**: Payment checkout broken for ~30% of users
**Status**: Resolved

---

## Timeline

| Time (UTC) | Event |
|------------|-------|
| 09:45 | Deployment of v1.2.4 to production |
| 10:15 | Alert fires: checkout error rate > 5% |
| 10:18 | On-call acknowledges, begins investigation |
| 10:32 | Root cause identified: new env var missing |
| 10:40 | Rollback initiated |
| 11:38 | Error rate returns to baseline, incident resolved |

---

## Root Cause Analysis

The v1.2.4 deployment introduced a new environment variable `PAYMENT_PROVIDER_KEY`
that was not added to the production Kubernetes Secret. The checkout service failed
to initialize the payment client on startup, causing all payment requests to fail
with a 500 error.

**Why did the deployment succeed?**
The readiness probe checks `/health` which does not validate payment client initialization.

**Why wasn't the missing secret caught earlier?**
There is no validation step in CI/CD that checks all required environment variables
are present in the target environment.

---

## What Went Well

- Alert fired within 30 seconds of error rate spike
- Rollback took only 3 minutes once root cause was identified
- Status page was updated within 10 minutes of incident start

---

## What Went Poorly

- 23 minutes elapsed between alert and root cause identification
- No staging environment parity check caught the missing secret
- Runbook for payment service outages was outdated

---

## Action Items

| Action | Owner | Due Date |
|--------|-------|---------|
| Add readiness probe that validates payment client init | Backend team | 2024-01-22 |
| Add CI/CD check: validate all env vars exist in target environment | Platform team | 2024-01-29 |
| Update payment service runbook | On-call team | 2024-01-19 |
| Add staging → production environment parity tests | Platform team | 2024-02-05 |

Runbooks

A runbook is a documented procedure for handling a specific alert or situation. Think of it as the “what to do when this breaks” document.

Good Runbook Structure

# High Payment Error Rate

## When This Alert Fires
Alert: `PaymentErrorRate > 5% for 2 minutes`
Severity: Sev-2

## Impact
Users cannot complete purchases. Direct revenue impact.

## Diagnostic Steps

### 1. Check recent deployments
kubectl rollout history deployment/payment-service
# Did anything deploy in the last 30 minutes?

### 2. Check error details
kubectl logs -l app=payment-service --since=10m | grep "ERROR" | head -20
# Look for: connection errors, auth errors, timeout patterns

### 3. Check payment provider status
curl https://status.stripe.com/api/v2/summary.json | jq '.status'
# If provider is down → nothing we can do, communicate to users

### 4. Check database connectivity
kubectl exec -it $(kubectl get pod -l app=payment-service -o name | head -1) -- \
  python -c "from app.db import db; db.execute('SELECT 1'); print('DB OK')"

## Mitigation Options

### Rollback (if recent deployment caused it)
kubectl rollout undo deployment/payment-service
# Wait 2-3 minutes, then verify error rate drops

### Scale up (if load-related)
kubectl scale deployment payment-service --replicas=10

## Escalation
If unable to resolve within 30 minutes, escalate to:
- Payment team lead: @payment-lead on Slack
- Engineering manager: listed in PagerDuty escalation policy

## Related Runbooks
- [Database connection failures](link)
- [High latency on payment service](link)

Reliability Mindset

SLI, SLO, SLA

Term	Meaning	Example
SLI (Indicator)	What you measure	% of requests with latency < 200ms
SLO (Objective)	Target for the SLI	99.5% of requests < 200ms per month
SLA (Agreement)	Contract with consequences	99% or we give refunds

Error budget = 100% - SLO target

SLO = 99.5% → error budget = 0.5% = 3.6 hours/month
If you’ve spent your budget, freeze deployments until next month

Reliability Patterns

Design for failure:

What happens when this service goes down? (fault isolation)
What happens if the database is slow? (timeouts, circuit breakers)
What happens if one AZ fails? (multi-AZ deployment)

Circuit breaker:

Normal: requests → service → response
Service getting slow: requests → circuit breaker → service (few requests, rest fail-fast)
Service down: requests → circuit breaker → immediate failure (don't wait for timeout)
Service recovering: circuit breaker → lets through a few requests to test

Key reliability metrics:

MTTR (Mean Time to Recovery) — how fast you fix incidents
MTTD (Mean Time to Detect) — how fast you find incidents
Change failure rate — % of deployments that cause incidents
Deployment frequency — how often you deploy (more = lower risk per deploy)

These are the DORA metrics — industry-standard measure of DevOps maturity.