Kubernetes β Operational, Not Theoretical
Cluster architecture, pods, deployments, services, ingress, storage, reliability, and common failures
Why Kubernetes Exists
Docker containers are great, but you need something to:
- Run containers across many servers (cluster)
- Restart them when they crash
- Scale up/down based on load
- Route traffic to healthy instances
- Roll out updates without downtime
- Manage secrets and config
Kubernetes (K8s) does all of this. Itβs the operating system for your cloud infrastructure.
Current Versions & Upgrade Awareness
Kubernetes releases a new minor version every ~4 months. Each version is supported for about 14 months.
# Check your cluster versionkubectl version --short
# Check node versionskubectl get nodes -o wide
# Check version skew (client vs server)# kubectl must be within 1 minor version of the API serverUpgrade path: You must upgrade one minor version at a time (e.g., 1.28 β 1.29 β 1.30, not 1.28 β 1.30 directly).
# Check what's deprecated/removed in a new version# kubectl deprecations aren't always obvious β use pluto or kubentkubent # scan for deprecated API usage before upgradingCluster Architecture (High-Level)
Control Plane: βββ API Server β all kubectl/API requests go here βββ etcd β cluster state database (key-value) βββ Scheduler β assigns pods to nodes βββ Controller Manager β reconciles desired vs actual state
Worker Nodes: βββ kubelet β runs on each node, manages pods βββ kube-proxy β handles service networking βββ Container Runtime (containerd)# Control plane componentskubectl get pods -n kube-system
# Node statuskubectl get nodeskubectl describe node my-node
# Cluster infokubectl cluster-infoPods & Controllers
Pod
The smallest deployable unit in Kubernetes. Usually contains one container (or tightly coupled sidecars).
# Bare pod (don't use in production β no restart if node fails)apiVersion: v1kind: Podmetadata: name: nginx labels: app: nginxspec: containers: - name: nginx image: nginx:1.25 ports: - containerPort: 80 resources: requests: memory: "64Mi" cpu: "100m" limits: memory: "128Mi" cpu: "500m"# Pod operationskubectl get podskubectl get pods -o wide # show node and IPkubectl describe pod nginxkubectl logs nginxkubectl logs nginx -f # followkubectl logs nginx --previous # previous container (after crash)kubectl exec -it nginx -- /bin/sh # shell into containerkubectl delete pod nginxDeployments vs StatefulSets
| Deployment | StatefulSet | |
|---|---|---|
| Pod identity | Random (nginx-abc123) | Stable (nginx-0, nginx-1) |
| Storage | Shared or stateless | Each pod gets own PVC |
| Scaling | Any order | Ordered (0 first, then 1β¦) |
| Use case | Stateless apps (web, API) | Databases, Kafka, ZooKeeper |
# DeploymentapiVersion: apps/v1kind: Deploymentmetadata: name: myapp namespace: productionspec: replicas: 3 selector: matchLabels: app: myapp strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 template: metadata: labels: app: myapp version: "1.2.3" spec: containers: - name: myapp image: myrepo/myapp:1.2.3 ports: - containerPort: 8080 env: - name: DATABASE_URL valueFrom: secretKeyRef: name: myapp-secrets key: database-url resources: requests: memory: "256Mi" cpu: "100m" limits: memory: "512Mi" cpu: "500m" readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 5 livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3# Deployment operationskubectl apply -f deployment.yamlkubectl get deploymentskubectl rollout status deployment/myappkubectl rollout history deployment/myappkubectl rollout undo deployment/myappkubectl scale deployment myapp --replicas=5kubectl set image deployment/myapp myapp=myrepo/myapp:1.2.4Services & Networking
Services provide stable DNS names and load balancing across pod replicas.
# ClusterIP (default) β internal access onlyapiVersion: v1kind: Servicemetadata: name: myappspec: selector: app: myapp # routes to pods with this label ports: - port: 80 # service port (cluster-internal) targetPort: 8080 # container port type: ClusterIP
---# NodePort β exposes on every node's IP at a static portspec: type: NodePort ports: - port: 80 targetPort: 8080 nodePort: 30080 # accessible at node-ip:30080
---# LoadBalancer β provisions cloud LB (AWS ELB, GCP LB, etc.)spec: type: LoadBalancer ports: - port: 443 targetPort: 8080# Service operationskubectl get serviceskubectl describe service myapp
# Check if service routes to pods correctlykubectl get endpoints myapp # should show pod IPs
# Test internal DNS (from inside a pod)kubectl exec -it debug -- nslookup myapp.production.svc.cluster.localkubectl exec -it debug -- curl http://myapp.production.svc.cluster.local/healthIngress
Ingress manages external HTTP/HTTPS routing to services. Requires an ingress controller (nginx, traefik, ALB Ingress Controller, etc.).
apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: myapp-ingress annotations: nginx.ingress.kubernetes.io/rewrite-target: / cert-manager.io/cluster-issuer: "letsencrypt-prod"spec: ingressClassName: nginx tls: - hosts: - myapp.example.com secretName: myapp-tls rules: - host: myapp.example.com http: paths: - path: /api pathType: Prefix backend: service: name: api-service port: number: 80 - path: / pathType: Prefix backend: service: name: frontend-service port: number: 80ConfigMaps & Secrets
# ConfigMap β non-sensitive configurationapiVersion: v1kind: ConfigMapmetadata: name: myapp-configdata: APP_ENV: "production" LOG_LEVEL: "info" config.yaml: | server: port: 8080 timeout: 30s
---# Secret β sensitive data (base64 encoded, not encrypted by default)apiVersion: v1kind: Secretmetadata: name: myapp-secretstype: OpaquestringData: # auto-encodes to base64 database-url: "postgres://user:pass@db:5432/mydb" api-key: "sk-live-abc123"# ConfigMapkubectl create configmap myapp-config --from-env-file=.envkubectl get configmap myapp-config -o yamlkubectl describe configmap myapp-config
# Secretkubectl create secret generic myapp-secrets \ --from-literal=database-url=postgres://user:pass@db:5432/mydb \ --from-file=tls.crt
kubectl get secret myapp-secrets -o jsonpath='{.data.database-url}' | base64 -dProduction secrets: Donβt store real secrets in YAML. Use:
- External Secrets Operator (syncs from AWS Secrets Manager)
- Sealed Secrets (encrypted in git)
- Vault Agent Injector
Storage β PVCs
# PersistentVolumeClaim β requests storageapiVersion: v1kind: PersistentVolumeClaimmetadata: name: postgres-dataspec: accessModes: - ReadWriteOnce # single node read-write storageClassName: gp3 resources: requests: storage: 20Gi
---# Use in podspec: containers: - name: postgres image: postgres:16 volumeMounts: - name: data mountPath: /var/lib/postgresql/data volumes: - name: data persistentVolumeClaim: claimName: postgres-dataAccess modes:
| Mode | Description | Use case |
|---|---|---|
ReadWriteOnce | One node, read-write | Databases |
ReadOnlyMany | Many nodes, read-only | Shared config |
ReadWriteMany | Many nodes, read-write | Shared file storage (NFS, EFS) |
Reliability: Probes & Resource Limits
Probes
containers: - name: myapp livenessProbe: # is the container alive? (restart if fails) httpGet: path: /health port: 8080 initialDelaySeconds: 30 # wait before first check periodSeconds: 10 failureThreshold: 3 # fail 3 times before restart
readinessProbe: # is the container ready for traffic? (remove from LB if fails) httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 3
startupProbe: # is the container started? (replaces liveness during startup) httpGet: path: /health port: 8080 failureThreshold: 30 # allow up to 5 minutes to start (30 * 10s) periodSeconds: 10Probe types:
httpGetβ HTTP request (2xx = healthy)tcpSocketβ TCP connection successexecβ Run command (exit 0 = healthy)
Resource Limits
resources: requests: # what K8s reserves for scheduling memory: "256Mi" cpu: "100m" # 100 millicores = 0.1 CPU limits: # hard ceiling (OOMKill if exceeded memory) memory: "512Mi" cpu: "1000m" # 1 full CPUDanger zones:
- No limits β one pod can starve others on the same node
- Limits too low β frequent OOMKills and CPU throttling
- Requests too high β hard to schedule pods
kubectl: Describe, Logs, Exec
# --- describe (most useful debugging command) ---kubectl describe pod myapp-abc123 # full status, events, conditionskubectl describe node my-node # node capacity, allocationskubectl describe service myapp # endpoints, selectorskubectl describe ingress myapp # routing rules, TLS
# --- logs ---kubectl logs myapp-abc123kubectl logs myapp-abc123 -f # followkubectl logs myapp-abc123 --previous # before restartkubectl logs myapp-abc123 -c sidecar # specific containerkubectl logs -l app=myapp --all-containers=true # all pods with labelkubectl logs myapp-abc123 --since=1h # last 1 hourkubectl logs myapp-abc123 --tail=100 # last 100 lines
# --- exec ---kubectl exec -it myapp-abc123 -- /bin/shkubectl exec myapp-abc123 -- env # print env varskubectl exec myapp-abc123 -- cat /etc/hostskubectl exec myapp-abc123 -- curl localhost:8080/health # internal health check
# --- debugging without exec ---# Ephemeral debug container (K8s 1.23+)kubectl debug -it myapp-abc123 --image=busybox --target=myapp
# Copy fileskubectl cp myapp-abc123:/app/logs/error.log ./error.logkubectl cp ./config.yaml myapp-abc123:/tmp/config.yamlCommon Failures
CrashLoopBackOff
Pod keeps crashing and K8s keeps restarting it with increasing delays.
# Check logs from the crashing containerkubectl logs myapp-abc123 --previous
# Check eventskubectl describe pod myapp-abc123 | grep -A20 Events
# Check exit codekubectl get pod myapp-abc123 -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'Common causes:
- Application crashes on startup (check logs for errors)
- Missing environment variable or secret
- Missing config file (ConfigMap not mounted correctly)
- Wrong
command/argsin deployment spec - OOMKill β increase memory limits
Image Pull Errors
kubectl describe pod myapp-abc123 | grep -A5 "Failed to pull"# Events:# Warning Failed ImagePullBackOff# Warning Failed Failed to pull image "myrepo/myapp:1.2.3": rpc error...Common causes:
- Image doesnβt exist (wrong tag)
- Registry requires authentication (create imagePullSecret)
- Network issue from node to registry
- Rate limit (Docker Hub limits unauthenticated pulls)
# Create image pull secret for private registrykubectl create secret docker-registry regcred \ --docker-server=123456789012.dkr.ecr.us-east-1.amazonaws.com \ --docker-username=AWS \ --docker-password=$(aws ecr get-login-password --region us-east-1)
# Reference in deploymentspec: imagePullSecrets: - name: regcredMisconfigured Service
# Symptoms: requests timing out, 502/503 from ingress
# 1. Check if service has endpointskubectl get endpoints myapp# If endpoints is empty, labels don't match
# 2. Check pod labels vs service selectorkubectl get pods --show-labels | grep myappkubectl get service myapp -o yaml | grep -A3 selector
# 3. Test directly to podkubectl port-forward pod/myapp-abc123 8080:8080curl localhost:8080/health
# 4. Test service DNS from inside clusterkubectl run test --rm -it --image=curlimages/curl -- \ curl http://myapp.default.svc.cluster.local/health