devops

Kubernetes β€” Operational, Not Theoretical

Cluster architecture, pods, deployments, services, ingress, storage, reliability, and common failures


Why Kubernetes Exists

Docker containers are great, but you need something to:

  • Run containers across many servers (cluster)
  • Restart them when they crash
  • Scale up/down based on load
  • Route traffic to healthy instances
  • Roll out updates without downtime
  • Manage secrets and config

Kubernetes (K8s) does all of this. It’s the operating system for your cloud infrastructure.


Current Versions & Upgrade Awareness

Kubernetes releases a new minor version every ~4 months. Each version is supported for about 14 months.

Terminal window
# Check your cluster version
kubectl version --short
# Check node versions
kubectl get nodes -o wide
# Check version skew (client vs server)
# kubectl must be within 1 minor version of the API server

Upgrade path: You must upgrade one minor version at a time (e.g., 1.28 β†’ 1.29 β†’ 1.30, not 1.28 β†’ 1.30 directly).

Terminal window
# Check what's deprecated/removed in a new version
# kubectl deprecations aren't always obvious β€” use pluto or kubent
kubent # scan for deprecated API usage before upgrading

Cluster Architecture (High-Level)

Control Plane:
β”œβ”€β”€ API Server β€” all kubectl/API requests go here
β”œβ”€β”€ etcd β€” cluster state database (key-value)
β”œβ”€β”€ Scheduler β€” assigns pods to nodes
└── Controller Manager β€” reconciles desired vs actual state
Worker Nodes:
β”œβ”€β”€ kubelet β€” runs on each node, manages pods
β”œβ”€β”€ kube-proxy β€” handles service networking
└── Container Runtime (containerd)
Terminal window
# Control plane components
kubectl get pods -n kube-system
# Node status
kubectl get nodes
kubectl describe node my-node
# Cluster info
kubectl cluster-info

Pods & Controllers

Pod

The smallest deployable unit in Kubernetes. Usually contains one container (or tightly coupled sidecars).

# Bare pod (don't use in production β€” no restart if node fails)
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "128Mi"
cpu: "500m"
Terminal window
# Pod operations
kubectl get pods
kubectl get pods -o wide # show node and IP
kubectl describe pod nginx
kubectl logs nginx
kubectl logs nginx -f # follow
kubectl logs nginx --previous # previous container (after crash)
kubectl exec -it nginx -- /bin/sh # shell into container
kubectl delete pod nginx

Deployments vs StatefulSets

DeploymentStatefulSet
Pod identityRandom (nginx-abc123)Stable (nginx-0, nginx-1)
StorageShared or statelessEach pod gets own PVC
ScalingAny orderOrdered (0 first, then 1…)
Use caseStateless apps (web, API)Databases, Kafka, ZooKeeper
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: myapp
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: myapp
version: "1.2.3"
spec:
containers:
- name: myapp
image: myrepo/myapp:1.2.3
ports:
- containerPort: 8080
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: myapp-secrets
key: database-url
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
Terminal window
# Deployment operations
kubectl apply -f deployment.yaml
kubectl get deployments
kubectl rollout status deployment/myapp
kubectl rollout history deployment/myapp
kubectl rollout undo deployment/myapp
kubectl scale deployment myapp --replicas=5
kubectl set image deployment/myapp myapp=myrepo/myapp:1.2.4

Services & Networking

Services provide stable DNS names and load balancing across pod replicas.

# ClusterIP (default) β€” internal access only
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp # routes to pods with this label
ports:
- port: 80 # service port (cluster-internal)
targetPort: 8080 # container port
type: ClusterIP
---
# NodePort β€” exposes on every node's IP at a static port
spec:
type: NodePort
ports:
- port: 80
targetPort: 8080
nodePort: 30080 # accessible at node-ip:30080
---
# LoadBalancer β€” provisions cloud LB (AWS ELB, GCP LB, etc.)
spec:
type: LoadBalancer
ports:
- port: 443
targetPort: 8080
Terminal window
# Service operations
kubectl get services
kubectl describe service myapp
# Check if service routes to pods correctly
kubectl get endpoints myapp # should show pod IPs
# Test internal DNS (from inside a pod)
kubectl exec -it debug -- nslookup myapp.production.svc.cluster.local
kubectl exec -it debug -- curl http://myapp.production.svc.cluster.local/health

Ingress

Ingress manages external HTTP/HTTPS routing to services. Requires an ingress controller (nginx, traefik, ALB Ingress Controller, etc.).

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- myapp.example.com
secretName: myapp-tls
rules:
- host: myapp.example.com
http:
paths:
- path: /api
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
- path: /
pathType: Prefix
backend:
service:
name: frontend-service
port:
number: 80

ConfigMaps & Secrets

# ConfigMap β€” non-sensitive configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: myapp-config
data:
APP_ENV: "production"
LOG_LEVEL: "info"
config.yaml: |
server:
port: 8080
timeout: 30s
---
# Secret β€” sensitive data (base64 encoded, not encrypted by default)
apiVersion: v1
kind: Secret
metadata:
name: myapp-secrets
type: Opaque
stringData: # auto-encodes to base64
database-url: "postgres://user:pass@db:5432/mydb"
api-key: "sk-live-abc123"
Terminal window
# ConfigMap
kubectl create configmap myapp-config --from-env-file=.env
kubectl get configmap myapp-config -o yaml
kubectl describe configmap myapp-config
# Secret
kubectl create secret generic myapp-secrets \
--from-literal=database-url=postgres://user:pass@db:5432/mydb \
--from-file=tls.crt
kubectl get secret myapp-secrets -o jsonpath='{.data.database-url}' | base64 -d

Production secrets: Don’t store real secrets in YAML. Use:

  • External Secrets Operator (syncs from AWS Secrets Manager)
  • Sealed Secrets (encrypted in git)
  • Vault Agent Injector

Storage β€” PVCs

# PersistentVolumeClaim β€” requests storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
spec:
accessModes:
- ReadWriteOnce # single node read-write
storageClassName: gp3
resources:
requests:
storage: 20Gi
---
# Use in pod
spec:
containers:
- name: postgres
image: postgres:16
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumes:
- name: data
persistentVolumeClaim:
claimName: postgres-data

Access modes:

ModeDescriptionUse case
ReadWriteOnceOne node, read-writeDatabases
ReadOnlyManyMany nodes, read-onlyShared config
ReadWriteManyMany nodes, read-writeShared file storage (NFS, EFS)

Reliability: Probes & Resource Limits

Probes

containers:
- name: myapp
livenessProbe: # is the container alive? (restart if fails)
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # wait before first check
periodSeconds: 10
failureThreshold: 3 # fail 3 times before restart
readinessProbe: # is the container ready for traffic? (remove from LB if fails)
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
startupProbe: # is the container started? (replaces liveness during startup)
httpGet:
path: /health
port: 8080
failureThreshold: 30 # allow up to 5 minutes to start (30 * 10s)
periodSeconds: 10

Probe types:

  • httpGet β€” HTTP request (2xx = healthy)
  • tcpSocket β€” TCP connection success
  • exec β€” Run command (exit 0 = healthy)

Resource Limits

resources:
requests: # what K8s reserves for scheduling
memory: "256Mi"
cpu: "100m" # 100 millicores = 0.1 CPU
limits: # hard ceiling (OOMKill if exceeded memory)
memory: "512Mi"
cpu: "1000m" # 1 full CPU

Danger zones:

  • No limits β†’ one pod can starve others on the same node
  • Limits too low β†’ frequent OOMKills and CPU throttling
  • Requests too high β†’ hard to schedule pods

kubectl: Describe, Logs, Exec

Terminal window
# --- describe (most useful debugging command) ---
kubectl describe pod myapp-abc123 # full status, events, conditions
kubectl describe node my-node # node capacity, allocations
kubectl describe service myapp # endpoints, selectors
kubectl describe ingress myapp # routing rules, TLS
# --- logs ---
kubectl logs myapp-abc123
kubectl logs myapp-abc123 -f # follow
kubectl logs myapp-abc123 --previous # before restart
kubectl logs myapp-abc123 -c sidecar # specific container
kubectl logs -l app=myapp --all-containers=true # all pods with label
kubectl logs myapp-abc123 --since=1h # last 1 hour
kubectl logs myapp-abc123 --tail=100 # last 100 lines
# --- exec ---
kubectl exec -it myapp-abc123 -- /bin/sh
kubectl exec myapp-abc123 -- env # print env vars
kubectl exec myapp-abc123 -- cat /etc/hosts
kubectl exec myapp-abc123 -- curl localhost:8080/health # internal health check
# --- debugging without exec ---
# Ephemeral debug container (K8s 1.23+)
kubectl debug -it myapp-abc123 --image=busybox --target=myapp
# Copy files
kubectl cp myapp-abc123:/app/logs/error.log ./error.log
kubectl cp ./config.yaml myapp-abc123:/tmp/config.yaml

Common Failures

CrashLoopBackOff

Pod keeps crashing and K8s keeps restarting it with increasing delays.

Terminal window
# Check logs from the crashing container
kubectl logs myapp-abc123 --previous
# Check events
kubectl describe pod myapp-abc123 | grep -A20 Events
# Check exit code
kubectl get pod myapp-abc123 -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'

Common causes:

  • Application crashes on startup (check logs for errors)
  • Missing environment variable or secret
  • Missing config file (ConfigMap not mounted correctly)
  • Wrong command/args in deployment spec
  • OOMKill β€” increase memory limits

Image Pull Errors

Terminal window
kubectl describe pod myapp-abc123 | grep -A5 "Failed to pull"
# Events:
# Warning Failed ImagePullBackOff
# Warning Failed Failed to pull image "myrepo/myapp:1.2.3": rpc error...

Common causes:

  • Image doesn’t exist (wrong tag)
  • Registry requires authentication (create imagePullSecret)
  • Network issue from node to registry
  • Rate limit (Docker Hub limits unauthenticated pulls)
Terminal window
# Create image pull secret for private registry
kubectl create secret docker-registry regcred \
--docker-server=123456789012.dkr.ecr.us-east-1.amazonaws.com \
--docker-username=AWS \
--docker-password=$(aws ecr get-login-password --region us-east-1)
# Reference in deployment
spec:
imagePullSecrets:
- name: regcred

Misconfigured Service

Terminal window
# Symptoms: requests timing out, 502/503 from ingress
# 1. Check if service has endpoints
kubectl get endpoints myapp
# If endpoints is empty, labels don't match
# 2. Check pod labels vs service selector
kubectl get pods --show-labels | grep myapp
kubectl get service myapp -o yaml | grep -A3 selector
# 3. Test directly to pod
kubectl port-forward pod/myapp-abc123 8080:8080
curl localhost:8080/health
# 4. Test service DNS from inside cluster
kubectl run test --rm -it --image=curlimages/curl -- \
curl http://myapp.default.svc.cluster.local/health