Performance Optimization
Layer 2: Design Choices
Core Question
What's the bottleneck, and is optimization worth it?
Before optimizing:
Have you measured? (Don't guess)
What's the acceptable performance?
Will optimization add complexity?
Performance Decision → Implementation
Goal
Design Choice
Implementation
Reduce allocations
Pre-allocate, reuse
with_capacity
, object pools
Improve cache
Contiguous data
Vec
,
SmallVec
Parallelize
Data parallelism
rayon
, threads
Avoid copies
Zero-copy
References,
Cow
Reduce indirection
Inline data
smallvec
, arrays
Thinking Prompt
Before optimizing:
Have you measured?
Profile first → flamegraph, perf
Benchmark → criterion, cargo bench
Identify actual hotspots
What's the priority?
Algorithm (10x-1000x improvement)
Data structure (2x-10x)
Allocation (2x-5x)
Cache (1.5x-3x)
What's the trade-off?
Complexity vs speed
Memory vs CPU
Latency vs throughput
Trace Up ↑
To domain constraints (Layer 3):
"How fast does this need to be?"
↑ Ask: What's the performance SLA?
↑ Check: domain- (latency requirements)
↑ Check: Business requirements (acceptable response time)
Question
Trace To
Ask
Latency requirements
domain-
What's acceptable response time?
Throughput needs
domain-
How many requests per second?
Memory constraints
domain-
What's the memory budget?
Trace Down ↓
To implementation (Layer 1):
"Need to reduce allocations"
↓ m01-ownership: Use references, avoid clone
↓ m02-resource: Pre-allocate with_capacity
"Need to parallelize"
↓ m07-concurrency: Choose rayon or threads
↓ m07-concurrency: Consider async for I/O-bound
"Need cache efficiency"
↓ Data layout: Prefer Vec over HashMap when possible
↓ Access patterns: Sequential over random access
Quick Reference
Tool
Purpose
cargo bench
Micro-benchmarks
criterion
Statistical benchmarks
perf
/
flamegraph
CPU profiling
heaptrack
Allocation tracking
valgrind
/
cachegrind
Cache analysis
Optimization Priority
1. Algorithm choice (10x - 1000x)
2. Data structure (2x - 10x)
3. Allocation reduction (2x - 5x)
4. Cache optimization (1.5x - 3x)
5. SIMD/Parallelism (2x - 8x)
Common Techniques
Technique
When
How
Pre-allocation
Known size
Vec::with_capacity(n)
Avoid cloning
Hot paths
Use references or
Cow
Batch operations
Many small ops
Collect then process
SmallVec
Usually small
smallvec::SmallVec<[T; N]>
Inline buffers
Fixed-size data
Arrays over Vec
Common Mistakes
Mistake
Why Wrong
Better
Optimize without profiling
Wrong target
Profile first
Benchmark in debug mode
Meaningless
Always
--release
Use LinkedList
Cache unfriendly
Vec
or
VecDeque
Hidden
.clone()
Unnecessary allocs
Use references
Premature optimization
Wasted effort
Make it work first
Anti-Patterns
Anti-Pattern
Why Bad
Better
Clone to avoid lifetimes
Performance cost
Proper ownership
Box everything
Indirection cost
Stack when possible
HashMap for small sets
Overhead
Vec with linear search
String concat in loop
O(n^2)
String::with_capacity
or
format!