devops

Linux β€” Thinking Like a System Engineer

File systems, processes, permissions, systemd, SSH, and the failures you'll actually face


Why Linux First?

Almost every server, container, CI runner, and cloud VM runs Linux. If you can’t navigate a broken Linux system at 2am, you’re not ready for on-call. This isn’t about memorizing commands β€” it’s about building a mental model of how the OS works.


File System Hierarchy

Linux organizes everything under a single root /. Understanding what lives where prevents a lot of confusion.

PathPurpose
/Root of the entire file system
/etcSystem-wide configuration files
/varVariable data β€” logs, spool files, databases
/tmpTemporary files, cleared on reboot
/homeUser home directories
/usrUser programs and libraries
/procVirtual filesystem β€” live kernel/process data
/sysVirtual filesystem β€” hardware and driver info
/devDevice files (disks, terminals, null)
Terminal window
# Browse the filesystem hierarchy
ls /
ls /etc | head -20
ls /var/log
# Check disk usage per directory
du -sh /* 2>/dev/null | sort -h

Every file in Linux has an inode β€” a data structure storing metadata (permissions, timestamps, owner, size, pointers to data blocks). The filename is just a pointer to an inode.

Terminal window
# See inode number
ls -i file.txt
# Check inode usage on a filesystem
df -i
Hard LinkSymbolic Link (symlink)
Points toInode directlyAnother path
Works across filesystems?NoYes
Survives original deletion?YesNo (dangling link)
Works on directories?No (usually)Yes
Terminal window
# Create a hard link
ln original.txt hardlink.txt
# Create a symbolic link
ln -s /etc/nginx/nginx.conf nginx.conf
# Find all broken symlinks
find /etc -type l ! -exec test -e {} \; -print

Users, Groups, Permissions, umask

Permission Model

-rwxr-xr-- 1 alice devs 4096 Jan 1 file.txt
|||||||||||
|user|group|other

Each section has read (r=4), write (w=2), execute (x=1).

Terminal window
# Change permissions
chmod 755 script.sh # rwxr-xr-x
chmod u+x,g-w script.sh # symbolic mode
# Change ownership
chown alice:devs file.txt
chown -R alice:devs /var/www/
# View effective permissions
stat file.txt
namei -l /path/to/file # trace permissions along path

umask

umask defines the default permissions subtracted from new files/directories.

Terminal window
# Check current umask
umask # e.g. 0022
# New file permissions = 0666 - 0022 = 0644 (rw-r--r--)
# New dir permissions = 0777 - 0022 = 0755 (rwxr-xr-x)
# Set umask in /etc/profile or ~/.bashrc
umask 027 # more restrictive β€” group can read, others nothing

sudo Internals & Security Implications

sudo doesn’t just run as root β€” it uses PAM for auth, checks /etc/sudoers for authorization, and logs everything to syslog.

Terminal window
# Edit sudoers safely (validates syntax before saving)
visudo
# Allow user to run specific commands without password
alice ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart nginx
# Check what you're allowed to run
sudo -l
# Run as a different user (not root)
sudo -u www-data id

Security gotchas:

  • NOPASSWD: ALL is almost always wrong β€” scope it to specific commands
  • sudo bash or sudo -i gives a full root shell β€” avoid in scripts
  • Check /var/log/auth.log or journalctl _COMM=sudo for sudo abuse

Process Lifecycle: fork, exec, signals

Every process was born from another process via fork() + exec().

init/systemd (PID 1)
└─ bash (fork)
└─ ls (exec replaces bash image)
Terminal window
# View process tree
pstree -p
ps auxf
# Find process by name
pgrep -a nginx
pidof nginx
# Send signals
kill -15 1234 # SIGTERM β€” graceful shutdown
kill -9 1234 # SIGKILL β€” force kill (no cleanup)
kill -1 1234 # SIGHUP β€” reload config (for many daemons)
kill -0 1234 # check if process exists (no signal sent)

Common signals:

SignalNumberMeaning
SIGTERM15Graceful termination
SIGKILL9Force kill (can’t be caught)
SIGHUP1Reload config
SIGINT2Interrupt (Ctrl+C)
SIGSTOP19Pause (can’t be caught)
SIGCONT18Resume

systemd

systemd is PID 1 on most modern Linux distros. It manages services, mounts, timers, sockets, and boot targets.

Units & Targets

Terminal window
# List all running services
systemctl list-units --type=service --state=running
# Start / stop / restart / reload
systemctl start nginx
systemctl stop nginx
systemctl restart nginx
systemctl reload nginx # graceful config reload
# Enable at boot / disable
systemctl enable nginx
systemctl disable nginx
# Check status with logs
systemctl status nginx -l
# Show dependencies
systemctl list-dependencies nginx

Service Unit File Anatomy

/etc/systemd/system/myapp.service
[Unit]
# Human-readable name in status listings
Description=Python app (example)
# Start after the network is reachable (optional; drop Wants= if you do not need it)
After=network-online.target
Wants=network-online.target
# Uncomment if this app must wait for Postgres
# Requires=postgresql.service
# After=postgresql.service
[Service]
Type=simple
User=myapp
Group=myapp
WorkingDirectory=/opt/myapp
# Ensure logs (or runtime files) exist before the process starts; must exit 0 or the unit fails
ExecStartPre=/bin/mkdir -p /opt/myapp/logs
# Main process: use the venv’s Python so deps match production (adjust script/module as needed)
ExecStart=/opt/myapp/venv/bin/python /opt/myapp/app.py
# Runs once the main command has been invoked (not after the app exits); keep this quick
ExecStartPost=/bin/sh -c 'echo "$(date -Is) myapp started" >> /opt/myapp/logs/boot.log'
Restart=on-failure
RestartSec=5s
Environment=PYTHONUNBUFFERED=1
Environment=PORT=8080
[Install]
WantedBy=multi-user.target
Terminal window
# After editing unit files
systemctl daemon-reload
systemctl restart myapp

Service Failures & Restart Loops

Terminal window
# Check why a service failed
systemctl status myapp
journalctl -u myapp -n 50 --no-pager
# Check restart count
systemctl show myapp --property=NRestarts
# Reset failure counter
systemctl reset-failed myapp

Debugging a restart loop:

  1. systemctl status myapp β€” look at exit code
  2. journalctl -u myapp -f β€” tail live logs
  3. Run the ExecStart command manually as the service user
  4. Check file permissions, missing env vars, port conflicts

Logs

journalctl Usage

Terminal window
# All logs, newest first
journalctl -r
# Follow live (like tail -f)
journalctl -f
# Since last boot
journalctl -b
# Specific service
journalctl -u nginx -n 100
# Time range
journalctl --since "2024-01-01 10:00" --until "2024-01-01 11:00"
# Priority levels (err and above)
journalctl -p err -b
# Show kernel messages only
journalctl -k
# Disk usage of journal
journalctl --disk-usage
# Vacuum old logs
journalctl --vacuum-size=500M

App Logs vs System Logs

TypeLocationTool
System/kerneljournaldjournalctl
Nginx access/var/log/nginx/access.logtail, grep, awk
App custom/var/log/myapp/ or stdoutdepends on app
Auth events/var/log/auth.loggrep, journalctl
Cron/var/log/cron or journaljournalctl -u cron

Disk

Mount Points & fstab

Terminal window
# Show all mounted filesystems
df -h
lsblk
mount | column -t
# Check /etc/fstab (filesystems to mount at boot)
cat /etc/fstab
# Mount manually
mount /dev/sdb1 /mnt/data
# Mount with options
mount -o ro,noexec /dev/sdc1 /mnt/backup

fstab entry structure:

DEVICE MOUNTPOINT TYPE OPTIONS DUMP PASS
UUID=abc123 / ext4 defaults 0 1
/dev/sdb1 /data xfs defaults,nofail 0 2

Disk Pressure & Inode Exhaustion

Terminal window
# Check disk space
df -h
# Check inode usage β€” can be full even when space is free!
df -i
# Find what's eating space
du -sh /var/* 2>/dev/null | sort -rh | head -20
ncdu /var # interactive (install with apt/yum)
# Find files larger than 100MB
find / -type f -size +100M 2>/dev/null
# Find directories with most files (inode issue)
find /tmp -type d -exec sh -c 'echo "$(ls -A "$1" | wc -l) $1"' _ {} \; | sort -rn | head

Memory & CPU

Terminal window
# Memory overview
free -h
vmstat 1 5 # 5 samples, 1 second apart
# Detailed memory stats
cat /proc/meminfo
# Top memory consumers
ps aux --sort=-%mem | head -10
# CPU load vs CPU usage
uptime # load average: 1, 5, 15 min
top # interactive
htop # better interactive
# Per-CPU stats
mpstat -P ALL 1 # (from sysstat package)

CPU Load vs CPU Usage:

  • CPU usage = percentage of time CPU is busy (0-100% per core)
  • CPU load average = number of processes waiting for CPU + running
  • On a 4-core system, load of 4.0 = 100% utilized; load of 8.0 = 200% (overloaded)

SSH

Terminal window
# Generate key pair
ssh-keygen -t ed25519 -C "your-email@example.com"
# Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server
# Manual copy (if ssh-copy-id unavailable)
cat ~/.ssh/id_ed25519.pub | ssh user@server "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
# Connect with specific key
ssh -i ~/.ssh/id_ed25519 user@server
# Connect with verbose output (debugging)
ssh -v user@server

Port Forwarding

Terminal window
# Local forwarding β€” access remote service locally
# Access server's port 5432 (postgres) as localhost:5432
ssh -L 5432:localhost:5432 user@server
# Remote forwarding β€” expose local port on remote server
# Expose local :8080 as server's :9090
ssh -R 9090:localhost:8080 user@server
# Dynamic forwarding (SOCKS proxy)
ssh -D 1080 user@server
# Jump host / bastion
ssh -J bastion-user@bastion.example.com target-user@private-server

~/.ssh/config for convenience:

Host myserver
HostName 1.2.3.4
User ubuntu
IdentityFile ~/.ssh/id_ed25519
Port 22
Host private-db
HostName 10.0.0.5
User admin
ProxyJump myserver

Common Failures You’ll Actually Face

Service Not Starting

Terminal window
systemctl status myapp
journalctl -u myapp --since "5 minutes ago"
# Run ExecStart command manually
# Check: ports in use, missing files, wrong user
ss -tlnp | grep :8080

Permission Denied

Terminal window
# Who owns the file?
ls -la /path/to/file
# What's the process running as?
ps aux | grep myapp
# Trace permission along path
namei -l /path/to/file
# Check SELinux/AppArmor if permissions look fine
getenforce # SELinux
aa-status # AppArmor
ausearch -m avc -ts recent # SELinux audit log

Disk Full

Terminal window
df -h # which filesystem is full?
df -i # check inodes too
# Find large files
du -sh /var/* | sort -rh | head
find /var/log -name "*.log" -size +100M
# Quick fixes
journalctl --vacuum-size=100M
truncate -s 0 /var/log/huge.log # zero out (don't delete if open)

High Load

Terminal window
uptime # load average
top # interactive β€” press 1 for per-CPU
ps aux --sort=-%cpu | head -10
# Is it CPU-bound or I/O-bound?
vmstat 1 5
# procs r column = processes waiting for CPU
# io wa column = I/O wait percentage
# High wa + low cpu = I/O bottleneck
# High cpu = CPU bottleneck
iostat -x 1 5 # per-disk I/O stats