devops

Incident Response & SRE Basics

On-call responsibility, incident lifecycle, postmortems, runbooks, and reliability mindset


On-Call Responsibility

Being on-call means you are accountable for production systems during your rotation. It’s a skill β€” not just β€œhaving a phone.”

What On-Call Actually Means

  • You’re the first line of defense for production incidents during your rotation
  • You may be paged at any time β€” 3am, holidays, weekends
  • Your goal: restore service as fast as possible, investigate causes after
  • You are NOT expected to know everything β€” you ARE expected to escalate intelligently

Good On-Call Hygiene

Before your rotation:

  • Read recent incident reports and know what broke recently
  • Know the escalation path (who to call if you’re stuck)
  • Make sure your laptop, VPN, and access credentials work
  • Test that you can actually receive pages

During an incident:

  • Acknowledge the alert quickly (stops escalation timers)
  • Communicate status early and often (even if it’s β€œI’m investigating”)
  • Focus on mitigation first, root cause second
  • Don’t work alone on a Sev-1 β€” pull in help

After your rotation:

  • Write runbooks for things that were hard to debug
  • Fix flaky alerts that paged for non-issues

Incident Lifecycle

Detection β†’ Triage β†’ Mitigation β†’ Resolution β†’ Postmortem

1. Detection

How you find out:

  • Automated alert from monitoring (best)
  • Customer reports (worst β€” means you failed to detect it yourself)
  • Teammate notice
Alert fires β†’ PagerDuty/OpsGenie β†’ Page on-call β†’ On-call acknowledges

2. Triage

Quickly answer:

  • Is this real or a false positive?
  • How many users are affected?
  • What is the blast radius?
  • What severity level is this?

Severity levels (example):

SevImpactResponse time
Sev-1Complete outage, many usersImmediate, all hands
Sev-2Major feature down, some users<15 minutes
Sev-3Minor feature impacted<1 hour
Sev-4Degraded performanceNext business day

3. Mitigation (Stop the Bleeding)

Goal: restore service, not find root cause. These are different.

Terminal window
# Common mitigation options:
# 1. Rollback the recent deployment
kubectl rollout undo deployment/myapp
helm rollback myapp 5
# 2. Scale up to handle load
kubectl scale deployment myapp --replicas=20
# 3. Restart stuck pods
kubectl rollout restart deployment/myapp
# 4. Switch traffic to backup region
# (update DNS, load balancer weights)
# 5. Enable maintenance mode / feature flag
# Flip a feature flag to disable broken feature
# 6. Restore database backup (last resort)
aws rds restore-db-instance-from-db-snapshot --db-snapshot-identifier ...

4. Resolution

Service is restored. Now:

  • Write a brief status update to stakeholders
  • Confirm metrics are back to normal
  • Verify the fix held for 15-30 minutes before declaring resolved

5. Postmortem

Written record of what happened, why, and what we’re doing about it.


Postmortems

A postmortem is a blameless analysis of an incident. The goal is to learn, not to punish.

Why Blameless?

If people fear blame, they hide information, don’t raise concerns, and learn nothing. A blameless culture means:

  • People are replaced by processes β€” β€œWhat process would have caught this?”
  • No names in the root cause β€” β€œAn operator deleted the database” not β€œJohn deleted the database”
  • Everyone who was involved contributes to the analysis

Postmortem Template

## Incident Summary
**Date**: 2024-01-15
**Duration**: 1 hour 23 minutes (10:15 - 11:38 UTC)
**Severity**: Sev-2
**Impact**: Payment checkout broken for ~30% of users
**Status**: Resolved
---
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 09:45 | Deployment of v1.2.4 to production |
| 10:15 | Alert fires: checkout error rate > 5% |
| 10:18 | On-call acknowledges, begins investigation |
| 10:32 | Root cause identified: new env var missing |
| 10:40 | Rollback initiated |
| 11:38 | Error rate returns to baseline, incident resolved |
---
## Root Cause Analysis
The v1.2.4 deployment introduced a new environment variable `PAYMENT_PROVIDER_KEY`
that was not added to the production Kubernetes Secret. The checkout service failed
to initialize the payment client on startup, causing all payment requests to fail
with a 500 error.
**Why did the deployment succeed?**
The readiness probe checks `/health` which does not validate payment client initialization.
**Why wasn't the missing secret caught earlier?**
There is no validation step in CI/CD that checks all required environment variables
are present in the target environment.
---
## What Went Well
- Alert fired within 30 seconds of error rate spike
- Rollback took only 3 minutes once root cause was identified
- Status page was updated within 10 minutes of incident start
---
## What Went Poorly
- 23 minutes elapsed between alert and root cause identification
- No staging environment parity check caught the missing secret
- Runbook for payment service outages was outdated
---
## Action Items
| Action | Owner | Due Date |
|--------|-------|---------|
| Add readiness probe that validates payment client init | Backend team | 2024-01-22 |
| Add CI/CD check: validate all env vars exist in target environment | Platform team | 2024-01-29 |
| Update payment service runbook | On-call team | 2024-01-19 |
| Add staging β†’ production environment parity tests | Platform team | 2024-02-05 |

Runbooks

A runbook is a documented procedure for handling a specific alert or situation. Think of it as the β€œwhat to do when this breaks” document.

Good Runbook Structure

# High Payment Error Rate
## When This Alert Fires
Alert: `PaymentErrorRate > 5% for 2 minutes`
Severity: Sev-2
## Impact
Users cannot complete purchases. Direct revenue impact.
## Diagnostic Steps
### 1. Check recent deployments
kubectl rollout history deployment/payment-service
# Did anything deploy in the last 30 minutes?
### 2. Check error details
kubectl logs -l app=payment-service --since=10m | grep "ERROR" | head -20
# Look for: connection errors, auth errors, timeout patterns
### 3. Check payment provider status
curl https://status.stripe.com/api/v2/summary.json | jq '.status'
# If provider is down β†’ nothing we can do, communicate to users
### 4. Check database connectivity
kubectl exec -it $(kubectl get pod -l app=payment-service -o name | head -1) -- \
python -c "from app.db import db; db.execute('SELECT 1'); print('DB OK')"
## Mitigation Options
### Rollback (if recent deployment caused it)
kubectl rollout undo deployment/payment-service
# Wait 2-3 minutes, then verify error rate drops
### Scale up (if load-related)
kubectl scale deployment payment-service --replicas=10
## Escalation
If unable to resolve within 30 minutes, escalate to:
- Payment team lead: @payment-lead on Slack
- Engineering manager: listed in PagerDuty escalation policy
## Related Runbooks
- [Database connection failures](link)
- [High latency on payment service](link)

Reliability Mindset

SLI, SLO, SLA

TermMeaningExample
SLI (Indicator)What you measure% of requests with latency < 200ms
SLO (Objective)Target for the SLI99.5% of requests < 200ms per month
SLA (Agreement)Contract with consequences99% or we give refunds

Error budget = 100% - SLO target

  • SLO = 99.5% β†’ error budget = 0.5% = 3.6 hours/month
  • If you’ve spent your budget, freeze deployments until next month

Reliability Patterns

Design for failure:

  • What happens when this service goes down? (fault isolation)
  • What happens if the database is slow? (timeouts, circuit breakers)
  • What happens if one AZ fails? (multi-AZ deployment)

Circuit breaker:

Normal: requests β†’ service β†’ response
Service getting slow: requests β†’ circuit breaker β†’ service (few requests, rest fail-fast)
Service down: requests β†’ circuit breaker β†’ immediate failure (don't wait for timeout)
Service recovering: circuit breaker β†’ lets through a few requests to test

Key reliability metrics:

  • MTTR (Mean Time to Recovery) β€” how fast you fix incidents
  • MTTD (Mean Time to Detect) β€” how fast you find incidents
  • Change failure rate β€” % of deployments that cause incidents
  • Deployment frequency β€” how often you deploy (more = lower risk per deploy)

These are the DORA metrics β€” industry-standard measure of DevOps maturity.