Observability
Metrics, logs, traces, RED/USE signals, alerting, Prometheus, Grafana, and OpenTelemetry
Metrics vs Logs vs Traces
These are the three pillars of observability. Together they answer: βWhat is wrong, and why?β
| Pillar | What it is | Answers |
|---|---|---|
| Metrics | Numeric measurements over time | βIs something wrong?β |
| Logs | Timestamped text events | βWhat happened?β |
| Traces | Request journey across services | βWhere is it slow/broken?β |
The order of debugging:
- Metrics alert you that something is wrong
- Logs give you detail on whatβs happening
- Traces show you exactly where in a distributed request the problem is
RED & USE Signals
Two frameworks for knowing what to measure.
RED (for Services)
| Signal | What to measure |
|---|---|
| Rate | Requests per second |
| Errors | Error rate (% or count) |
| Duration | Request latency (p50, p95, p99) |
Use RED for every service: HTTP APIs, gRPC endpoints, queues.
# Rate β requests per second (last 5 min)rate(http_requests_total[5m])
# Error raterate(http_requests_total{status=~"5.."}[5m]) /rate(http_requests_total[5m])
# Latency p99histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))USE (for Resources)
| Signal | What to measure |
|---|---|
| Utilization | % time resource is busy |
| Saturation | Work waiting (queue depth) |
| Errors | Error count |
Use USE for infrastructure: CPUs, memory, disk, network.
# CPU utilization100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory saturation (swap usage)node_memory_SwapCached_bytes / node_memory_SwapTotal_bytes
# Disk errorsrate(node_disk_io_time_seconds_total[5m])Alerting Philosophy
Good alerts wake people up for things they can act on. Bad alerts train people to ignore pages.
Principles
- Alert on symptoms, not causes β βerror rate > 5%β not βCPU > 80%β
- Alert on user impact β things users experience, not internal metrics
- Use SLOs as alert thresholds β alert when youβre burning your error budget
- Have runbooks β every alert should link to a document explaining what to do
Alert Severity Levels
| Level | Meaning | Response |
|---|---|---|
| Critical/Page | User impact right now | Wake someone up, act immediately |
| Warning | Will become critical if not fixed | Fix during business hours |
| Info | Awareness, no action needed | Dashboard/ticket only, no page |
# Example Prometheus alerting rulegroups: - name: myapp.rules rules: - alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 2m # must be true for 2 minutes before firing labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }}" runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency expr: | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2 for: 5m labels: severity: warning annotations: summary: "p99 latency above 2 seconds"Prometheus
Prometheus is a time-series database that scrapes metrics from your services.
Architecture
Apps expose /metrics endpoint β βΌPrometheus scrapes (pulls) metrics every 15s β ββ Stores in time-series DB ββ Evaluates alerting rules β Alertmanager β PagerDuty/Slack ββ Grafana queries via PromQLMetric Types
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing | http_requests_total |
| Gauge | Goes up and down | memory_usage_bytes |
| Histogram | Bucketed observations (for latency) | http_request_duration_seconds_bucket |
| Summary | Pre-calculated quantiles | rpc_latency_seconds |
Kubernetes Setup (kube-prometheus-stack)
# Install the full monitoring stackhelm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo update
helm install monitoring prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set grafana.adminPassword=mysecretpassword \ --values monitoring-values.yamlPromQL Basics
# Instant vector β current valueup # 1 = up, 0 = downhttp_requests_total # all series
# Filter by labelhttp_requests_total{method="GET", status="200"}http_requests_total{status=~"5.."} # regex: 5xx errors
# Range vector β values over time windowhttp_requests_total[5m] # last 5 minutes of samples
# Functionsrate(http_requests_total[5m]) # per-second rate of increaseirate(http_requests_total[5m]) # instant rate (last 2 samples)increase(http_requests_total[1h]) # total increase over 1 hour
avg(rate(http_requests_total[5m]))sum by (status)(rate(http_requests_total[5m]))
# Aggregation over timeavg_over_time(response_time[24h])
# Quantile from histogramhistogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))Grafana Dashboards
Grafana visualizes metrics from Prometheus (and other sources).
# Access Grafana (port-forward)kubectl port-forward svc/monitoring-grafana 3000:80 -n monitoring
# Default credentials# admin / mysecretpassword (set during install)Dashboard Best Practices
- Start with RED/USE β dedicate one row per service to Rate, Errors, Duration
- Show p50, p95, p99 latency β not just average (averages hide outliers)
- Add SLO panels β show error budget burn rate
- Annotations for deployments β mark when deploys happened
- Link to runbooks β panel descriptions can link to wiki
// Example panel query (in dashboard JSON){ "expr": "histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))", "legendFormat": "p99 latency"}Logging Patterns
# Structured logging β always log JSON in production{ "timestamp": "2024-01-15T10:30:00Z", "level": "error", "service": "payment-service", "trace_id": "abc123", "span_id": "def456", "user_id": "usr_789", "message": "Payment failed", "error": "connection timeout", "amount": 99.99, "currency": "USD"}Why structured logging matters:
- Machine-parseable β easy to query in Loki/Elasticsearch/CloudWatch
- Consistent fields β can alert on log patterns
- Trace ID β link logs to traces
Log Aggregation Stack
Apps β stdout/stderr β βΌ (collected by DaemonSet)Promtail (on each K8s node) β βΌLoki (log storage) β βΌGrafana (query with LogQL)# Install Loki stackhelm install loki grafana/loki-stack \ --namespace monitoring \ --set promtail.enabled=true \ --set grafana.enabled=false # already have grafana# LogQL queries (Grafana / Loki){namespace="production", app="myapp"} |= "ERROR"{app="myapp"} | json | level="error" | line_format "{{.message}}"OpenTelemetry
Why It Exists
Before OpenTelemetry, every vendor had its own SDK (Datadog, New Relic, Jaeger, Zipkin). Switching vendors meant rewriting all your instrumentation.
OpenTelemetry (OTel) is the vendor-neutral standard for collecting traces, metrics, and logs. You instrument once, export to any backend.
How Traces Flow
User request hits Service A β Service A creates a "span" (start time, attributes) β Service A calls Service B (propagates trace context via HTTP header) β Service B creates a child span β Service B queries DB (another child span) β Service B responds β Service A gets response, closes its span β Both services send spans to OTel Collector β Collector exports to Jaeger/Tempo/Datadog/etc.Trace: user requestβββ Span: Service A /api/order (100ms total)β βββ Span: Service B /checkout (60ms)β β βββ Span: DB query SELECT * FROM... (45ms) β bottleneck!β β βββ Span: Redis cache check (2ms)β βββ Span: Send confirmation email (5ms)Basic Instrumentation (Node.js)
// tracing.js β initialize before anything elseconst { NodeSDK } = require('@opentelemetry/sdk-node');const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], // auto-instruments HTTP, Express, pg, redis, etc.});
sdk.start();// Manual spans for custom operationsconst { trace } = require('@opentelemetry/api');const tracer = trace.getTracer('myapp');
async function processPayment(orderId, amount) { return tracer.startActiveSpan('process-payment', async (span) => { try { span.setAttribute('order.id', orderId); span.setAttribute('payment.amount', amount);
const result = await chargeCard(amount); span.setStatus({ code: SpanStatusCode.OK }); return result; } catch (error) { span.recordException(error); span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); throw error; } finally { span.end(); } });}Debugging with Traces
# Scenario: "Payment API is slow for some users"
# Step 1: Metrics alert fires β p99 latency > 2s# Step 2: Find slow traces in Jaeger/Tempo# Filter: service=payment-api, duration > 2s
# Step 3: Open a slow trace β see span waterfall:Trace (2.3s):βββ POST /api/payments (2.3s)β βββ Validate request (1ms)β βββ Fetch user (45ms)β βββ Check fraud score (1.8s) β 78% of time here!β β βββ HTTP call to fraud-service (1.8s)β β βββ DB query in fraud-service (1.7s) β missing index!β βββ Process charge (250ms)
# Step 4: Fix β add index to fraud-service DB query# Step 5: Verify with traces that p99 improved