Introduction

Performance profiling identifies where your application spends its time — CPU, memory, I/O, or blocking. Without profiling, optimization is guesswork. This article covers four profiling approaches: perf for system-level Linux profiling, flamegraphs for visualization, py-spy for Python without code changes, and pprof for Go applications.

Performance Profiling: perf, Flamegraphs, py-spy, pprof

perf (Linux Profiler)

The built-in Linux profiler for CPU, hardware events, and tracepoints:

CPU profiling

perf record -F 99 -g ./myapp # Sample at 99Hz with call graphs

perf record -F 99 -p PID -g --sleep 30 # Profile running process for 30s

perf report --stdio # Text report

perf report -g graph # Call graph report

Common events

perf stat ./myapp # Execution statistics

perf stat -e cache-misses ./myapp # Cache miss analysis

perf stat -e branch-misses ./myapp # Branch prediction

perf stat -e context-switches -p PID # Context switch monitoring

Hardware event sampling

perf record -e cycles -F 99 -a -g --sleep 10 # System-wide CPU sampling

Tracepoints

perf record -e sched:sched_switch -a -g # Context switch tracing

perf record -e syscalls:sys_enter_write -a # Write syscall tracing

Top-like live view

perf top -p PID

perf top -e cache-misses

Generate flamegraph data

perf script > out.perf

Key metrics : cycles for CPU time, cache-misses for memory bottleneck detection, context-switches for contention issues.

Flamegraphs

Brendan Gregg's visualization for profiler output:

Install FlameGraph tools

git clone https://github.com/brendangregg/FlameGraph

Generate flamegraph from perf data

perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded

./FlameGraph/flamegraph.pl out.folded > flamegraph.svg

Generate differential flamegraph (before/after)

After optimization:

perf script | ./FlameGraph/stackcollapse-perf.pl > optimized.folded

./FlameGraph/difffolded.pl before.folded optimized.folded | ./FlameGraph/flamegraph.pl > diff.svg

Reading flamegraphs : The x-axis shows stack profile population (not time). Each rectangle is a function call; wider rectangles mean more CPU time. The y-axis is stack depth. Look for wide top rectangles — those are the hot functions.

For other languages :

JavaScript (Node.js)

node --perf-basic-prof app.js

perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded

Python with py-spy

py-spy record -o profile.svg --pid $PID

Go with pprof

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

py-spy

Sampling profiler for Python without modifying code:

Installation

pip install py-spy

Profile a running process

py-spy record -o profile.svg --pid 12345

py-spy record -o profile.svg -- python myapp.py

Top-like live view

py-spy top --pid 12345

Dump current stack traces

py-spy dump --pid 12345

Profile specific duration

py-spy record -o profile.svg --pid 12345 --duration 30

With subprocesses

py-spy record -o profile.svg -- python myapp.py --subprocesses

Native frames

py-spy record --native -o profile.svg --pid 12345

Save raw data for later analysis

py-spy record -o profile.raw --pid 12345 --format raw

Key advantages : No code changes required, works with running processes, safe for production (read-only), native code frame support.

pprof (Go)

Go's built-in profiling tool:

package main

import (

"net/http"

_ "net/http/pprof"

)

func main() {

// Start pprof HTTP server

go func() {

http.ListenAndServe("localhost:6060", nil)

}()

// Your application code...

}

Collect profiles

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 # CPU

go tool pprof http://localhost:6060/debug/pprof/heap # Memory

go tool pprof http://localhost:6060/debug/pprof/goroutine # Goroutines

go tool pprof http://localhost:6060/debug/pprof/block # Blocking

go tool pprof http://localhost:6060/debug/pprof/mutex # Mutex contention

Interactive exploration

go tool pprof cpu.pprof

(pprof) top10 # Top 10 functions

(pprof) list myFunc # Source with line-level timing

(pprof) web # Open in browser (requires graphviz)

(pprof) pdf # Generate PDF

(pprof) peek myFunc # Caller/callee view

Web interface

go tool pprof -http=:8080 cpu.pprof

Allocations profiling

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/allocs

Compare profiles

go tool pprof -http=:8080 -diff_base=before.pprof after.pprof

Profiling Workflow

1\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Identify the problem (slow response, high CPU, OOM)

2\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Profile without optimization

perf record -F 99 -p $(pgrep myapp) -g --sleep 30

3\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Generate flamegraph

perf script | stackcollapse-perf.pl | flamegraph.pl > before.svg

4\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Make optimization

5\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Profile again with same parameters

perf record -F 99 -p $(pgrep myapp) -g --sleep 30

perf script | stackcollapse-perf.pl | flamegraph.pl > after.svg

6\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Create differential flamegraph

./difffolded.pl before.folded after.folded | flamegraph.pl > diff.svg

Comparison

| Tool | Language | Overhead | Best For |

|------|----------|----------|----------|

| perf | Any (system) | Low | CPU, cache misses, syscalls |

| flamegraphs | Any (visualization) | None | Visual hotspot identification |

| py-spy | Python | Very low | Production Python profiling |

| pprof | Go | Low | Go CPU, memory, goroutines |

| FlameGraph | Any (post-processing) | None | Comparative analysis |

Recommendations

  • Initial investigation : Use perf top to quickly identify CPU hotspots.

  • Detailed analysis : Collect perf data and generate flamegraphs for visual hotspot identification.

  • Python profiling : Use py-spy for production-safe sampling without code changes.

  • Go profiling : Use pprof with its web interface for interactive exploration.

  • Comparison : Use differential flamegraphs to verify optimization impact.

Profiling is an iterative process: identify hotspots, form a hypothesis, make a change, and re-profile to verify improvement. Flamegraphs make this loop faster by providing immediate visual feedback on where time is spent.