Incident Response & SRE Basics
On-call responsibility, incident lifecycle, postmortems, runbooks, and reliability mindset
On-Call Responsibility
Being on-call means you are accountable for production systems during your rotation. Itβs a skill β not just βhaving a phone.β
What On-Call Actually Means
- Youβre the first line of defense for production incidents during your rotation
- You may be paged at any time β 3am, holidays, weekends
- Your goal: restore service as fast as possible, investigate causes after
- You are NOT expected to know everything β you ARE expected to escalate intelligently
Good On-Call Hygiene
Before your rotation:
- Read recent incident reports and know what broke recently
- Know the escalation path (who to call if youβre stuck)
- Make sure your laptop, VPN, and access credentials work
- Test that you can actually receive pages
During an incident:
- Acknowledge the alert quickly (stops escalation timers)
- Communicate status early and often (even if itβs βIβm investigatingβ)
- Focus on mitigation first, root cause second
- Donβt work alone on a Sev-1 β pull in help
After your rotation:
- Write runbooks for things that were hard to debug
- Fix flaky alerts that paged for non-issues
Incident Lifecycle
Detection β Triage β Mitigation β Resolution β Postmortem1. Detection
How you find out:
- Automated alert from monitoring (best)
- Customer reports (worst β means you failed to detect it yourself)
- Teammate notice
Alert fires β PagerDuty/OpsGenie β Page on-call β On-call acknowledges2. Triage
Quickly answer:
- Is this real or a false positive?
- How many users are affected?
- What is the blast radius?
- What severity level is this?
Severity levels (example):
| Sev | Impact | Response time |
|---|---|---|
| Sev-1 | Complete outage, many users | Immediate, all hands |
| Sev-2 | Major feature down, some users | <15 minutes |
| Sev-3 | Minor feature impacted | <1 hour |
| Sev-4 | Degraded performance | Next business day |
3. Mitigation (Stop the Bleeding)
Goal: restore service, not find root cause. These are different.
# Common mitigation options:# 1. Rollback the recent deploymentkubectl rollout undo deployment/myapphelm rollback myapp 5
# 2. Scale up to handle loadkubectl scale deployment myapp --replicas=20
# 3. Restart stuck podskubectl rollout restart deployment/myapp
# 4. Switch traffic to backup region# (update DNS, load balancer weights)
# 5. Enable maintenance mode / feature flag# Flip a feature flag to disable broken feature
# 6. Restore database backup (last resort)aws rds restore-db-instance-from-db-snapshot --db-snapshot-identifier ...4. Resolution
Service is restored. Now:
- Write a brief status update to stakeholders
- Confirm metrics are back to normal
- Verify the fix held for 15-30 minutes before declaring resolved
5. Postmortem
Written record of what happened, why, and what weβre doing about it.
Postmortems
A postmortem is a blameless analysis of an incident. The goal is to learn, not to punish.
Why Blameless?
If people fear blame, they hide information, donβt raise concerns, and learn nothing. A blameless culture means:
- People are replaced by processes β βWhat process would have caught this?β
- No names in the root cause β βAn operator deleted the databaseβ not βJohn deleted the databaseβ
- Everyone who was involved contributes to the analysis
Postmortem Template
## Incident Summary
**Date**: 2024-01-15**Duration**: 1 hour 23 minutes (10:15 - 11:38 UTC)**Severity**: Sev-2**Impact**: Payment checkout broken for ~30% of users**Status**: Resolved
---
## Timeline
| Time (UTC) | Event ||------------|-------|| 09:45 | Deployment of v1.2.4 to production || 10:15 | Alert fires: checkout error rate > 5% || 10:18 | On-call acknowledges, begins investigation || 10:32 | Root cause identified: new env var missing || 10:40 | Rollback initiated || 11:38 | Error rate returns to baseline, incident resolved |
---
## Root Cause Analysis
The v1.2.4 deployment introduced a new environment variable `PAYMENT_PROVIDER_KEY`that was not added to the production Kubernetes Secret. The checkout service failedto initialize the payment client on startup, causing all payment requests to failwith a 500 error.
**Why did the deployment succeed?**The readiness probe checks `/health` which does not validate payment client initialization.
**Why wasn't the missing secret caught earlier?**There is no validation step in CI/CD that checks all required environment variablesare present in the target environment.
---
## What Went Well
- Alert fired within 30 seconds of error rate spike- Rollback took only 3 minutes once root cause was identified- Status page was updated within 10 minutes of incident start
---
## What Went Poorly
- 23 minutes elapsed between alert and root cause identification- No staging environment parity check caught the missing secret- Runbook for payment service outages was outdated
---
## Action Items
| Action | Owner | Due Date ||--------|-------|---------|| Add readiness probe that validates payment client init | Backend team | 2024-01-22 || Add CI/CD check: validate all env vars exist in target environment | Platform team | 2024-01-29 || Update payment service runbook | On-call team | 2024-01-19 || Add staging β production environment parity tests | Platform team | 2024-02-05 |Runbooks
A runbook is a documented procedure for handling a specific alert or situation. Think of it as the βwhat to do when this breaksβ document.
Good Runbook Structure
# High Payment Error Rate
## When This Alert FiresAlert: `PaymentErrorRate > 5% for 2 minutes`Severity: Sev-2
## ImpactUsers cannot complete purchases. Direct revenue impact.
## Diagnostic Steps
### 1. Check recent deploymentskubectl rollout history deployment/payment-service# Did anything deploy in the last 30 minutes?
### 2. Check error detailskubectl logs -l app=payment-service --since=10m | grep "ERROR" | head -20# Look for: connection errors, auth errors, timeout patterns
### 3. Check payment provider statuscurl https://status.stripe.com/api/v2/summary.json | jq '.status'# If provider is down β nothing we can do, communicate to users
### 4. Check database connectivitykubectl exec -it $(kubectl get pod -l app=payment-service -o name | head -1) -- \ python -c "from app.db import db; db.execute('SELECT 1'); print('DB OK')"
## Mitigation Options
### Rollback (if recent deployment caused it)kubectl rollout undo deployment/payment-service# Wait 2-3 minutes, then verify error rate drops
### Scale up (if load-related)kubectl scale deployment payment-service --replicas=10
## EscalationIf unable to resolve within 30 minutes, escalate to:- Payment team lead: @payment-lead on Slack- Engineering manager: listed in PagerDuty escalation policy
## Related Runbooks- [Database connection failures](link)- [High latency on payment service](link)Reliability Mindset
SLI, SLO, SLA
| Term | Meaning | Example |
|---|---|---|
| SLI (Indicator) | What you measure | % of requests with latency < 200ms |
| SLO (Objective) | Target for the SLI | 99.5% of requests < 200ms per month |
| SLA (Agreement) | Contract with consequences | 99% or we give refunds |
Error budget = 100% - SLO target
- SLO = 99.5% β error budget = 0.5% = 3.6 hours/month
- If youβve spent your budget, freeze deployments until next month
Reliability Patterns
Design for failure:
- What happens when this service goes down? (fault isolation)
- What happens if the database is slow? (timeouts, circuit breakers)
- What happens if one AZ fails? (multi-AZ deployment)
Circuit breaker:
Normal: requests β service β responseService getting slow: requests β circuit breaker β service (few requests, rest fail-fast)Service down: requests β circuit breaker β immediate failure (don't wait for timeout)Service recovering: circuit breaker β lets through a few requests to testKey reliability metrics:
- MTTR (Mean Time to Recovery) β how fast you fix incidents
- MTTD (Mean Time to Detect) β how fast you find incidents
- Change failure rate β % of deployments that cause incidents
- Deployment frequency β how often you deploy (more = lower risk per deploy)
These are the DORA metrics β industry-standard measure of DevOps maturity.