devops

Observability

Metrics, logs, traces, RED/USE signals, alerting, Prometheus, Grafana, and OpenTelemetry


Metrics vs Logs vs Traces

These are the three pillars of observability. Together they answer: β€œWhat is wrong, and why?”

PillarWhat it isAnswers
MetricsNumeric measurements over time”Is something wrong?”
LogsTimestamped text events”What happened?”
TracesRequest journey across services”Where is it slow/broken?”

The order of debugging:

  1. Metrics alert you that something is wrong
  2. Logs give you detail on what’s happening
  3. Traces show you exactly where in a distributed request the problem is

RED & USE Signals

Two frameworks for knowing what to measure.

RED (for Services)

SignalWhat to measure
RateRequests per second
ErrorsError rate (% or count)
DurationRequest latency (p50, p95, p99)

Use RED for every service: HTTP APIs, gRPC endpoints, queues.

# Rate β€” requests per second (last 5 min)
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
# Latency p99
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

USE (for Resources)

SignalWhat to measure
Utilization% time resource is busy
SaturationWork waiting (queue depth)
ErrorsError count

Use USE for infrastructure: CPUs, memory, disk, network.

# CPU utilization
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory saturation (swap usage)
node_memory_SwapCached_bytes / node_memory_SwapTotal_bytes
# Disk errors
rate(node_disk_io_time_seconds_total[5m])

Alerting Philosophy

Good alerts wake people up for things they can act on. Bad alerts train people to ignore pages.

Principles

  1. Alert on symptoms, not causes β€” β€œerror rate > 5%” not β€œCPU > 80%”
  2. Alert on user impact β€” things users experience, not internal metrics
  3. Use SLOs as alert thresholds β€” alert when you’re burning your error budget
  4. Have runbooks β€” every alert should link to a document explaining what to do

Alert Severity Levels

LevelMeaningResponse
Critical/PageUser impact right nowWake someone up, act immediately
WarningWill become critical if not fixedFix during business hours
InfoAwareness, no action neededDashboard/ticket only, no page
# Example Prometheus alerting rule
groups:
- name: myapp.rules
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m]) > 0.05
for: 2m # must be true for 2 minutes before firing
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency above 2 seconds"

Prometheus

Prometheus is a time-series database that scrapes metrics from your services.

Architecture

Apps expose /metrics endpoint
β”‚
β–Ό
Prometheus scrapes (pulls) metrics every 15s
β”‚
β”œβ”€ Stores in time-series DB
β”œβ”€ Evaluates alerting rules β†’ Alertmanager β†’ PagerDuty/Slack
└─ Grafana queries via PromQL

Metric Types

TypeDescriptionExample
CounterMonotonically increasinghttp_requests_total
GaugeGoes up and downmemory_usage_bytes
HistogramBucketed observations (for latency)http_request_duration_seconds_bucket
SummaryPre-calculated quantilesrpc_latency_seconds

Kubernetes Setup (kube-prometheus-stack)

Terminal window
# Install the full monitoring stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=mysecretpassword \
--values monitoring-values.yaml

PromQL Basics

# Instant vector β€” current value
up # 1 = up, 0 = down
http_requests_total # all series
# Filter by label
http_requests_total{method="GET", status="200"}
http_requests_total{status=~"5.."} # regex: 5xx errors
# Range vector β€” values over time window
http_requests_total[5m] # last 5 minutes of samples
# Functions
rate(http_requests_total[5m]) # per-second rate of increase
irate(http_requests_total[5m]) # instant rate (last 2 samples)
increase(http_requests_total[1h]) # total increase over 1 hour
avg(rate(http_requests_total[5m]))
sum by (status)(rate(http_requests_total[5m]))
# Aggregation over time
avg_over_time(response_time[24h])
# Quantile from histogram
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Grafana Dashboards

Grafana visualizes metrics from Prometheus (and other sources).

Terminal window
# Access Grafana (port-forward)
kubectl port-forward svc/monitoring-grafana 3000:80 -n monitoring
# Default credentials
# admin / mysecretpassword (set during install)

Dashboard Best Practices

  1. Start with RED/USE β€” dedicate one row per service to Rate, Errors, Duration
  2. Show p50, p95, p99 latency β€” not just average (averages hide outliers)
  3. Add SLO panels β€” show error budget burn rate
  4. Annotations for deployments β€” mark when deploys happened
  5. Link to runbooks β€” panel descriptions can link to wiki
// Example panel query (in dashboard JSON)
{
"expr": "histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "p99 latency"
}

Logging Patterns

# Structured logging β€” always log JSON in production
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "error",
"service": "payment-service",
"trace_id": "abc123",
"span_id": "def456",
"user_id": "usr_789",
"message": "Payment failed",
"error": "connection timeout",
"amount": 99.99,
"currency": "USD"
}

Why structured logging matters:

  • Machine-parseable β†’ easy to query in Loki/Elasticsearch/CloudWatch
  • Consistent fields β†’ can alert on log patterns
  • Trace ID β†’ link logs to traces

Log Aggregation Stack

Apps β†’ stdout/stderr
β”‚
β–Ό (collected by DaemonSet)
Promtail (on each K8s node)
β”‚
β–Ό
Loki (log storage)
β”‚
β–Ό
Grafana (query with LogQL)
Terminal window
# Install Loki stack
helm install loki grafana/loki-stack \
--namespace monitoring \
--set promtail.enabled=true \
--set grafana.enabled=false # already have grafana
# LogQL queries (Grafana / Loki)
{namespace="production", app="myapp"} |= "ERROR"
{app="myapp"} | json | level="error" | line_format "{{.message}}"

OpenTelemetry

Why It Exists

Before OpenTelemetry, every vendor had its own SDK (Datadog, New Relic, Jaeger, Zipkin). Switching vendors meant rewriting all your instrumentation.

OpenTelemetry (OTel) is the vendor-neutral standard for collecting traces, metrics, and logs. You instrument once, export to any backend.

How Traces Flow

User request hits Service A
β†’ Service A creates a "span" (start time, attributes)
β†’ Service A calls Service B (propagates trace context via HTTP header)
β†’ Service B creates a child span
β†’ Service B queries DB (another child span)
β†’ Service B responds
β†’ Service A gets response, closes its span
β†’ Both services send spans to OTel Collector
β†’ Collector exports to Jaeger/Tempo/Datadog/etc.
Trace: user request
β”œβ”€β”€ Span: Service A /api/order (100ms total)
β”‚ β”œβ”€β”€ Span: Service B /checkout (60ms)
β”‚ β”‚ β”œβ”€β”€ Span: DB query SELECT * FROM... (45ms) ← bottleneck!
β”‚ β”‚ └── Span: Redis cache check (2ms)
β”‚ └── Span: Send confirmation email (5ms)

Basic Instrumentation (Node.js)

// tracing.js β€” initialize before anything else
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()], // auto-instruments HTTP, Express, pg, redis, etc.
});
sdk.start();
// Manual spans for custom operations
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('myapp');
async function processPayment(orderId, amount) {
return tracer.startActiveSpan('process-payment', async (span) => {
try {
span.setAttribute('order.id', orderId);
span.setAttribute('payment.amount', amount);
const result = await chargeCard(amount);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
throw error;
} finally {
span.end();
}
});
}

Debugging with Traces

# Scenario: "Payment API is slow for some users"
# Step 1: Metrics alert fires β€” p99 latency > 2s
# Step 2: Find slow traces in Jaeger/Tempo
# Filter: service=payment-api, duration > 2s
# Step 3: Open a slow trace β€” see span waterfall:
Trace (2.3s):
β”œβ”€β”€ POST /api/payments (2.3s)
β”‚ β”œβ”€β”€ Validate request (1ms)
β”‚ β”œβ”€β”€ Fetch user (45ms)
β”‚ β”œβ”€β”€ Check fraud score (1.8s) ← 78% of time here!
β”‚ β”‚ └── HTTP call to fraud-service (1.8s)
β”‚ β”‚ └── DB query in fraud-service (1.7s) ← missing index!
β”‚ └── Process charge (250ms)
# Step 4: Fix β€” add index to fraud-service DB query
# Step 5: Verify with traces that p99 improved