Introduction
Performance profiling identifies where your application spends its time — CPU, memory, I/O, or blocking. Without profiling, optimization is guesswork. This article covers four profiling approaches: perf for system-level Linux profiling, flamegraphs for visualization, py-spy for Python without code changes, and pprof for Go applications.

perf (Linux Profiler)
The built-in Linux profiler for CPU, hardware events, and tracepoints:
CPU profiling
perf record -F 99 -g ./myapp # Sample at 99Hz with call graphs
perf record -F 99 -p PID -g --sleep 30 # Profile running process for 30s
perf report --stdio # Text report
perf report -g graph # Call graph report
Common events
perf stat ./myapp # Execution statistics
perf stat -e cache-misses ./myapp # Cache miss analysis
perf stat -e branch-misses ./myapp # Branch prediction
perf stat -e context-switches -p PID # Context switch monitoring
Hardware event sampling
perf record -e cycles -F 99 -a -g --sleep 10 # System-wide CPU sampling
Tracepoints
perf record -e sched:sched_switch -a -g # Context switch tracing
perf record -e syscalls:sys_enter_write -a # Write syscall tracing
Top-like live view
perf top -p PID
perf top -e cache-misses
Generate flamegraph data
perf script > out.perf
Key metrics : cycles for CPU time, cache-misses for memory bottleneck detection, context-switches for contention issues.
Flamegraphs
Brendan Gregg's visualization for profiler output:
Install FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph
Generate flamegraph from perf data
perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded
./FlameGraph/flamegraph.pl out.folded > flamegraph.svg
Generate differential flamegraph (before/after)
After optimization:
perf script | ./FlameGraph/stackcollapse-perf.pl > optimized.folded
./FlameGraph/difffolded.pl before.folded optimized.folded | ./FlameGraph/flamegraph.pl > diff.svg
Reading flamegraphs : The x-axis shows stack profile population (not time). Each rectangle is a function call; wider rectangles mean more CPU time. The y-axis is stack depth. Look for wide top rectangles — those are the hot functions.
For other languages :
JavaScript (Node.js)
node --perf-basic-prof app.js
perf script | ./FlameGraph/stackcollapse-perf.pl > out.folded
Python with py-spy
py-spy record -o profile.svg --pid $PID
Go with pprof
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
py-spy
Sampling profiler for Python without modifying code:
Installation
pip install py-spy
Profile a running process
py-spy record -o profile.svg --pid 12345
py-spy record -o profile.svg -- python myapp.py
Top-like live view
py-spy top --pid 12345
Dump current stack traces
py-spy dump --pid 12345
Profile specific duration
py-spy record -o profile.svg --pid 12345 --duration 30
With subprocesses
py-spy record -o profile.svg -- python myapp.py --subprocesses
Native frames
py-spy record --native -o profile.svg --pid 12345
Save raw data for later analysis
py-spy record -o profile.raw --pid 12345 --format raw
Key advantages : No code changes required, works with running processes, safe for production (read-only), native code frame support.
pprof (Go)
Go's built-in profiling tool:
package main
import (
"net/http"
_ "net/http/pprof"
)
func main() {
// Start pprof HTTP server
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// Your application code...
}
Collect profiles
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 # CPU
go tool pprof http://localhost:6060/debug/pprof/heap # Memory
go tool pprof http://localhost:6060/debug/pprof/goroutine # Goroutines
go tool pprof http://localhost:6060/debug/pprof/block # Blocking
go tool pprof http://localhost:6060/debug/pprof/mutex # Mutex contention
Interactive exploration
go tool pprof cpu.pprof
(pprof) top10 # Top 10 functions
(pprof) list myFunc # Source with line-level timing
(pprof) web # Open in browser (requires graphviz)
(pprof) pdf # Generate PDF
(pprof) peek myFunc # Caller/callee view
Web interface
go tool pprof -http=:8080 cpu.pprof
Allocations profiling
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/allocs
Compare profiles
go tool pprof -http=:8080 -diff_base=before.pprof after.pprof
Profiling Workflow
1\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Identify the problem (slow response, high CPU, OOM)
2\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Profile without optimization
perf record -F 99 -p $(pgrep myapp) -g --sleep 30
3\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > before.svg
4\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Make optimization
5\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Profile again with same parameters
perf record -F 99 -p $(pgrep myapp) -g --sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > after.svg
6\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Create differential flamegraph
./difffolded.pl before.folded after.folded | flamegraph.pl > diff.svg
Comparison
| Tool | Language | Overhead | Best For |
|------|----------|----------|----------|
| perf | Any (system) | Low | CPU, cache misses, syscalls |
| flamegraphs | Any (visualization) | None | Visual hotspot identification |
| py-spy | Python | Very low | Production Python profiling |
| pprof | Go | Low | Go CPU, memory, goroutines |
| FlameGraph | Any (post-processing) | None | Comparative analysis |
Recommendations
-
Initial investigation : Use
perf topto quickly identify CPU hotspots. -
Detailed analysis : Collect perf data and generate flamegraphs for visual hotspot identification.
-
Python profiling : Use py-spy for production-safe sampling without code changes.
-
Go profiling : Use pprof with its web interface for interactive exploration.
-
Comparison : Use differential flamegraphs to verify optimization impact.
Profiling is an iterative process: identify hotspots, form a hypothesis, make a change, and re-profile to verify improvement. Flamegraphs make this loop faster by providing immediate visual feedback on where time is spent.
Enjoy this article? Share your thoughts, questions, or experiences in the comments below — your insights help other readers too.
Join the discussion ↓