Table of Contents
-
Understanding Performance Metrics: What to Measure
- Key Metrics: CPU, Memory, Disk I/O, Network
- Essential Monitoring Tools
-
CPU Tuning: Maximizing Processing Efficiency
- Scheduler Optimization (CFS, Real-Time Scheduling)
- CPU Affinity and NUMA Awareness
- Hyper-Threading and Interrupt Handling
-
Memory Tuning: Reducing Latency and Waste
- Page Cache and Swap Management
- Huge Pages and Transparent Huge Pages (THP)
- OOM Killer and Memory Overcommitment
-
Disk I/O Tuning: Speeding Up Storage Access
- I/O Schedulers (CFQ, Deadline, NOOP)
- Filesystem Optimization (Ext4, XFS, Btrfs)
- Block Device Tuning (Read-Ahead, Queue Depth)
-
Network Tuning: Enhancing Throughput and Reliability
- TCP/IP Stack Optimization
- Interrupt Coalescing and Offloading
- Congestion Control Algorithms
-
Kernel and System Configuration
- Kernel Parameters via
sysctl - Boot-Time Tuning (GRUB)
- Systemd Resource Management
- Kernel Parameters via
-
Advanced Monitoring and Profiling
perf: CPU and Function-Level Profilingbpftrace: eBPF-Powered Tracing- Long-Term Monitoring with Prometheus/Grafana
-
Case Studies and Best Practices
- Web Server Tuning (Nginx/Apache)
- Database Optimization (PostgreSQL/MySQL)
- High-Performance Computing (HPC) Workloads
1. Understanding Performance Metrics: What to Measure
Before tuning, you must measure—blind optimization is risky and often counterproductive. Focus on metrics that reflect bottlenecks in CPU, memory, disk I/O, or network.
Key Performance Metrics
| Resource | Critical Metrics | What They Indicate |
|---|---|---|
| CPU | Usage (%user, %system, %idle), load average, context switches, runqueue length | CPU saturation, user vs. kernel time, process contention. |
| Memory | Used/available RAM, swap usage, page faults (major/minor), cache命中率 (page cache hit ratio) | Memory leaks, excessive swapping, inefficient cache usage. |
| Disk I/O | Throughput (MB/s), IOPS, latency (read/write), queue depth, %util | Slow storage, I/O saturation, misconfigured schedulers. |
| Network | Bandwidth (TX/RX), packet loss, latency (RTT), TCP retransmissions, socket backlogs | Network congestion, misconfigured TCP settings, hardware bottlenecks. |
Essential Monitoring Tools
top/htop: Real-time CPU/memory usage, process activity.- Example:
htopshows per-core CPU usage and memory breakdown.
- Example:
vmstat: System-wide metrics (processes, memory, swap, I/O, CPU).- Example:
vmstat 5(refresh every 5 seconds) reveals trends in page faults or I/O waits.
- Example:
iostat: Disk I/O details (throughput, IOPS, latency).- Example:
iostat -x 5(extended stats) highlights slow disks (%util > 90% = saturation).
- Example:
sar: Historical performance data (CPU, memory, network, disk).- Example:
sar -u 5 10(CPU usage every 5s for 10 samples).
- Example:
ss/netstat: Network socket stats (connections, backlogs, TCP states).- Example:
ss -ti(TCP sockets with timers) identifies stuck connections.
- Example:
2. CPU Tuning: Maximizing Processing Efficiency
CPU bottlenecks often manifest as high load averages, long runqueues, or processes stuck in R (running) state. Advanced tuning focuses on scheduler behavior, core allocation, and reducing overhead.
Scheduler Optimization
Linux uses the Completely Fair Scheduler (CFS) by default, which balances CPU time across processes. For latency-sensitive workloads (e.g., real-time apps), consider:
- Real-Time Schedulers: Use
SCHED_FIFOorSCHED_RRfor critical tasks (viachrt).- Example:
chrt -f 99 ./realtime-app(runrealtime-appwith FIFO scheduler, priority 99).
- Example:
- CFS Tuning: Adjust
sched_min_granularity_ns(minimum time a task runs) orsched_latency_ns(target latency for all tasks).- Smaller
sched_min_granularity_nsimproves interactivity; larger values reduce context-switch overhead.
- Smaller
CPU Affinity and NUMA Awareness
- CPU Affinity: Pin processes to specific cores to reduce cache misses (e.g.,
taskset).- Example:
taskset -c 0,1 ./database(run database on cores 0 and 1).
- Example:
- NUMA (Non-Uniform Memory Access): On multi-socket systems, memory access is faster from local NUMA nodes. Use
numactlto bind processes to nodes:- Example:
numactl --cpunodebind=0 --membind=0 ./app(runappon NUMA node 0, use its memory).
- Example:
Hyper-Threading and Interrupt Handling
- Hyper-Threading (HT): Enable HT for workloads with high instruction-level parallelism (e.g., web servers), but disable for latency-critical apps (HT shares core resources). Check with
lscpu | grep 'Thread(s) per core'. - Interrupt Coalescing: Reduce CPU overhead from network/disk interrupts by batching interrupts (use
ethtool -C eth0 rx-usecs 200to set coalescing delay).
3. Memory Tuning: Reducing Latency and Waste
Memory bottlenecks (e.g., swapping, high page faults) cripple performance. Tune to minimize latency and maximize efficient use of RAM.
Page Cache and Swap Management
- Page Cache: Linux caches disk reads/writes in RAM to reduce I/O. Monitor cache hit ratio:
- Formula:
1 - (major page faults / total page faults). Aim for >99% for read-heavy workloads. - Adjust
vm.vfs_cache_pressure(default 100): Lower values (e.g., 50) prioritize keeping cache; higher (e.g., 200) frees cache aggressively.
- Formula:
- Swap: Use
vm.swappiness(0–100) to control swap tendency. For memory-sensitive apps (databases), setvm.swappiness=10to minimize swapping; for desktop systems, use60(default).
Huge Pages and Transparent Huge Pages (THP)
- Huge Pages: Reduce TLB (Translation Lookaside Buffer) misses by using 2MB/1GB pages instead of 4KB. Critical for databases (e.g., PostgreSQL, Oracle).
- Enable:
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages(allocate 1024x2MB pages).
- Enable:
- Transparent Huge Pages (THP): Auto-manages huge pages (enabled by default). Disable for latency-critical apps (e.g., Redis) via
echo never > /sys/kernel/mm/transparent_hugepage/enabled.
OOM Killer and Memory Overcommitment
- OOM Killer: Kills processes when out of memory. Prevent unintended kills by setting
oom_score_adjfor critical apps (e.g.,echo -1000 > /proc/<pid>/oom_score_adj). - Memory Overcommitment: Control with
vm.overcommit_memory:0(default): Heuristic overcommit.1: Always overcommit (risky but useful for HPC).2: Never overcommit (usevm.overcommit_ratioto set allowed overcommit %).
4. Disk I/O Tuning: Speeding Up Storage Access
Disk I/O is often the slowest link. Tune schedulers, filesystems, and block devices to reduce latency.
I/O Schedulers
The kernel uses I/O schedulers to order disk requests. Choose based on workload:
| Scheduler | Best For | How It Works |
|---|---|---|
deadline | Latency-sensitive (databases, real-time apps) | Prioritizes requests by deadline to avoid starvation. |
cfq (default) | General-purpose (desktops, mixed workloads) | Fairly distributes I/O bandwidth across processes. |
noop | SSDs, RAID arrays with hardware controllers | Passes requests directly (no reordering—hardware handles optimization). |
- Set scheduler temporarily:
echo deadline > /sys/block/sda/queue/scheduler. - Persist with
udevrules: Create/etc/udev/rules.d/60-scheduler.rules:ACTION=="add|change", KERNEL=="sda", ATTR{queue/scheduler}="deadline"
Filesystem Optimization
- Ext4/XFS: For most workloads, XFS handles large files better; Ext4 is more mature for small files.
- Tune Ext4: Disable journaling for temporary storage (
mkfs.ext4 -O ^has_journal /dev/sdb1). - XFS: Increase log buffer size (
mount -o logbsize=256k /dev/sdb1 /mnt).
- Tune Ext4: Disable journaling for temporary storage (
- Mount Options: Use
noatime(disable access time logging) orrelatime(log access time only if modified) to reduce writes:mount -o relatime /dev/sda1 /
Block Device Tuning
- Read-Ahead: Preload data into cache. Increase for sequential reads (e.g., media servers):
blockdev --setra 8192 /dev/sda(8192 sectors = 4MB).
- Queue Depth: Adjust
nr_requeststo match storage capabilities (SSDs tolerate deeper queues):echo 256 > /sys/block/sda/queue/nr_requests(default 128).
5. Network Tuning: Enhancing Throughput and Reliability
Network bottlenecks often stem from misconfigured TCP stacks or hardware inefficiencies.
TCP/IP Stack Optimization
- TCP Buffers: Increase send/receive buffers to handle high bandwidth-delay products (e.g., long-distance links):
sysctl -w net.core.rmem_max=16777216 # Max receive buffer (16MB) sysctl -w net.core.wmem_max=16777216 # Max send buffer (16MB) sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216" # Min/default/max receive sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216" # Min/default/max send - TCP Keepalive: Detect dead connections faster (e.g., for load balancers):
sysctl -w net.ipv4.tcp_keepalive_time=60 # Send keepalive after 60s idle sysctl -w net.ipv4.tcp_keepalive_intvl=10 # Interval between probes sysctl -w net.ipv4.tcp_keepalive_probes=3 # Probes before terminating
Congestion Control Algorithms
- BBR (Bottleneck Bandwidth and RTT): Optimizes for high throughput and low latency (ideal for video streaming, cloud). Enable with:
sysctl -w net.ipv4.tcp_congestion_control=bbr - CUBIC: Default in Linux; balances fairness and throughput for general use.
Interrupt Offloading
Offload CPU-intensive tasks (checksum, segmentation) to network hardware:
ethtool -K eth0 tx-checksum-ipv4 on # Enable TX checksum offload
ethtool -K eth0 tso on gso on # Enable TCP segmentation offload
6. Kernel and System Configuration
Fine-tune the kernel and systemd to align with workload demands.
Kernel Parameters via sysctl
Persist sysctl changes in /etc/sysctl.d/99-tuning.conf:
# Increase file descriptors (for high-concurrency apps like Nginx)
fs.file-max=1000000
# TCP tuning
net.ipv4.tcp_congestion_control=bbr
net.core.somaxconn=4096 # Increase socket backlog
# Memory tuning
vm.swappiness=10
vm.min_free_kbytes=65536 # Reserve 64MB for critical kernel paths
Apply with sysctl --system.
Boot-Time Tuning (GRUB)
Modify GRUB_CMDLINE_LINUX in /etc/default/grub for kernel boot parameters:
GRUB_CMDLINE_LINUX="intel_idle.max_cstate=1 elevator=deadline transparent_hugepage=never"
intel_idle.max_cstate=1: Reduce CPU idle states for lower latency.elevator=deadline: Set default I/O scheduler.
Update GRUB:grub2-mkconfig -o /boot/grub2/grub.cfg(RHEL/CentOS) orupdate-grub(Debian/Ubuntu).
Systemd Resource Management
Limit resources for non-critical services with systemd slices/units:
# /etc/systemd/system/app.service.d/limits.conf
[Service]
CPUQuota=50% # Limit to 50% CPU
MemoryLimit=1G # Max 1GB RAM
7. Advanced Monitoring and Profiling
Basic tools (top, iostat) show what is slow; advanced tools reveal why.
perf: CPU and Function-Level Profiling
perf top: Real-time CPU usage by function (e.g., identify hot code paths in apps).perf record -g -p <pid>: Record call graphs for a process, thenperf reportto analyze.- Example: Profile Nginx:
perf record -g -p $(pidof nginx), thenperf reportto see where CPU time is spent.
bpftrace: eBPF-Powered Tracing
eBPF (Extended Berkeley Packet Filter) enables low-overhead kernel tracing. Use bpftrace for custom analysis:
- Trace disk I/O latency:
bpftrace -e 'tracepoint:block:block_rq_complete { @us = hist(args->delta / 1000); }' - Find file opens by process:
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
Long-Term Monitoring
- Prometheus + Grafana: Collect metrics (via
node_exporter) and visualize trends (e.g., CPU usage, memory leaks). sar+sadf: Archive historical data:sar -o /var/log/sa/sa$(date +%d)(daily logs), thensadf -d /var/log/sa/sa01for CSV output.
8. Case Studies and Best Practices
Web Server Tuning (Nginx)
- Align worker processes with CPU cores:
worker_processes auto;(Nginx config). - Increase file descriptors:
worker_rlimit_nofile 100000;. - Tune TCP: Enable
tcp_nopush on; tcp_nodelay on;for better throughput/latency.
Database Tuning (PostgreSQL)
- Use huge pages:
shared_buffers = 25% of RAM(instead of default 128MB). - Set
effective_cache_size = 50% of RAM(guides query planner). - Use
deadlineI/O scheduler and disable THP.
Best Practices
- Benchmark First: Use
sysbench(CPU/memory),fio(disk), oriperf(network) to establish baselines. - Tune Incrementally: Change one parameter at a time and re-benchmark.
- Document Everything: Log changes, metrics, and outcomes (e.g., “2024-05-01: vm.swappiness=10 → swap usage dropped 40%“).
9. References
- Linux Kernel Documentation
manpages:man sysctl,man perf,man bpftrace- Brendan Gregg’s Systems Performance (book)
- Red Hat: Performance Tuning Guide
- Nginx: Tuning Guide
- PostgreSQL: Performance Tuning
By mastering these techniques, you’ll transform Linux systems from “good enough” to “optimized for success.” Remember: performance tuning is a journey, not a destination—continuously monitor, adapt, and refine!