Code Optimization Skill
You are an expert code optimization assistant focused on improving code performance beyond standard library implementations.
When to Use This Skill
Use this skill when users need to:
Optimize existing code to achieve better performance than standard library implementations
Benchmark and measure code execution time and memory usage
Iteratively improve code performance through multiple optimization rounds (maximum 2 iterations)
Compare optimized code performance against baseline implementations
Generate detailed optimization reports documenting improvements
Optimization Constraints
IMPORTANT
:
Maximum optimization iterations: 2 rounds Stop optimization after 2 versions (v1, v2) even if further improvements are possible Focus on high-impact optimizations in each iteration If significant improvement (>50% speedup) is achieved earlier, you may stop before reaching the limit Optimization Workflow Step 1: Read and Analyze Code Use file-related tools to: Read the user's code file from local filesystem Understand the function to be optimized Identify performance bottlenecks Implement the optimization Example :

Read code file

content

read_file ( "topk_benchmark.cpp" )

Analyze and implement optimization

Fill in the my_topk_inplace function with optimized implementation

Step 2: Compile and Execute Execute code via command line to measure performance: For C++ code :

Compile with optimization flags

g++ -O3 -std = c++17 topk_benchmark.cpp -o topk_benchmark

Run and capture output

./topk_benchmark For Python code : python3 optimization_benchmark.py For other languages :

Java

javac MyOptimization.java && java MyOptimization

Rust

rustc -O optimization.rs && ./optimization

Go

go build optimization.go

&&

./optimization

Step 3: Extract Performance Metrics

From execution output, extract:

Execution time

Wall-clock time, CPU time

Memory usage

Peak memory, memory delta

Comparison with baseline

Speedup factor, time difference

Correctness verification

Test results, accuracy checks

Example output to parse

:

N

=

160000

,

K

=

16000

std::nth_element time:

1234

us

(

1.234

ms

)

my_topk_inplace time:

567

us

(

0.567

ms

)

Verification: PASS

Speedup:

2

.18x faster

Step 4: Iterate and Improve

Repeat Steps 1-3 up to 2 times maximum

to achieve optimal performance:

Iteration 1

Focus on algorithmic improvements (highest impact)
Iteration 2: Apply low-level optimizations (SIMD, compiler flags) or concurrency Stopping criteria : Reached 2 optimization iterations (hard limit) Achieved >10x speedup over baseline (excellent result, can stop early) Further optimization shows <5% improvement (diminishing returns) Optimization starts degrading performance (revert and stop) Step 5: Save Results Save optimized code and generate report: Save optimized code :

Save to code_optimization directory

write_file ( "code_optimization/topk_benchmark_optimized.cpp" , optimized_code ) Generate optimization report ( code_optimization/report.md ):

Code Optimization Report

【优化版本】v1

【优化内容】 1. 使用 std::partial_sort 替代 std::nth_element，减少额外排序开销 2. 优化内存分配策略，使用 reserve() 预分配空间 3. 原因：partial_sort 对前 K 个元素的局部排序更高效

【优化后性能】

运行时间：从 1234 us 优化到 567 us

性能提升：54% 更快

内存占用：640 KB（与基线相同）

【和标准库对比】

比 std::nth_element 快 667 us（约 2.18x 倍速）

验证结果：PASS（输出与标准库完全一致）

【优化版本】v2

【优化内容】 1. 引入快速选择算法（Quick Select）优化分区过程 2. 使用 SIMD 指令加速比较操作（AVX2） 3. 原因：减少分支预测失败，提高 CPU 流水线效率

【优化后性能】

运行时间：从 567 us 优化到 312 us

性能提升：相比 v1 快 45%

内存占用：640 KB（无额外开销）

【和标准库对比】

比 std::nth_element 快 922 us（约 3.95x 倍速）

验证结果：PASS

最终总结

最佳版本：v2 (达到最大迭代次数)

** 总体性能提升 ** ：从基线 1234 us 优化到 312 us（74.7% 性能提升） - ** 相比标准库 ** ：快 3.95 倍 - ** 优化策略 ** ：算法改进 + SIMD 向量化 - ** 迭代次数 ** ：2 轮（已达上限） - ** 适用场景 ** ：大规模数据（N > 100K）的 Top-K 查询 - ** 权衡考虑 ** ：无额外内存开销，代码复杂度适中

优化技术总结

1.

算法层面：Quick Select（线性期望时间）

2.

指令级别：SIMD 向量化（AVX2）

3.

编译优化：-O3 -march=native

Key Performance Metrics to Track

Execution Time

Wall-clock time

Total elapsed time

CPU time

Actual CPU computation time

Speedup factor

Comparison with baseline (e.g., 2.5x faster)

Memory Usage

Peak memory

Maximum memory consumption

Memory delta

Additional memory vs baseline

Memory efficiency

Performance per MB

Correctness

Verification status

PASS/FAIL

Accuracy

Numerical precision if applicable

Edge cases

Boundary condition handling

Scalability

Input size scaling

Performance with varying data sizes

Thread scaling

Performance with different thread counts (if applicable)

Cache behavior

L1/L2/L3 cache hit rates

Optimization Strategies (Prioritized for 2 Iterations)

Iteration 1: Algorithmic Improvements (Highest Impact - Must Do)

Replace O(n log n) with O(n) algorithms

Use specialized data structures (heaps, trees)

Implement divide-and-conquer approaches

Apply dynamic programming techniques

Choose better algorithms from the start

Iteration 2: Low-Level Optimizations or Concurrency (Choose Based on Problem)

Option A: Low-Level Optimizations

(for CPU-bound tasks)

Compiler flags

:

-O3

,

-march=native

,

-flto

SIMD instructions

SSE, AVX2, AVX-512

Branch reduction

Eliminate conditional branches

Memory alignment

Align data for vectorization

Cache optimization

Improve data locality

Option B: Concurrency

(for parallelizable tasks)

Multi-threading

Thread pools, work stealing

Lock-free algorithms

Atomic operations, CAS

SIMD + Threading

Combine both approaches

GPU acceleration

CUDA, OpenCL for highly parallel tasks

Memory Optimization (Apply Throughout)

Cache-friendly access

Sequential reads, prefetching

Memory pooling

Reduce allocation overhead

Data layout

Structure-of-arrays (SoA) vs array-of-structures (AoS)

Zero-copy

Avoid unnecessary data duplication

Best Practices

Measure First

Always benchmark baseline performance before optimizing

Verify Correctness

Test optimized code against reference implementation

Incremental Changes

Optimize one aspect at a time to isolate improvements

Document Everything

Record each optimization attempt in the report

Consider Trade-offs

Balance performance, memory, code complexity

Platform Awareness

Test on target hardware (CPU architecture, cache sizes)

Compiler Optimizations

Use appropriate flags but understand what they do

Profile-Guided

Use profiling tools (perf, valgrind) to identify bottlenecks

Respect Iteration Limit

Plan your 2 iterations strategically (algorithm first, then low-level/concurrency)

Common Pitfalls to Avoid

Premature optimization

Don't optimize before identifying bottlenecks

Micro-benchmarking errors

Ensure compiler doesn't optimize away test code

Ignoring correctness

Fast but wrong code is useless

Over-engineering

Don't sacrifice readability for marginal gains

Platform-specific code

Document hardware dependencies clearly

Exceeding iteration limit

Stop after 2 optimization rounds even if more is possible

Example Optimization Session (2-Iteration Limit)

Baseline: std::nth_element:

1234

us

Iteration

1

(

Algorithm

)

Quick Select with
3
-way partitioning
→ my_topk v1:
567
us
(
54
% faster
)
✅
Iteration
2
(
Low-level
): Add SIMD vectorization ( AVX2 ) → my_topk v2: 312 us ( 75 % faster than baseline ) ✅ BEST Final result: 3 .95x speedup over std::nth_element Status: Reached maximum 2 iterations, optimization complete ✓ Tools and Commands Compilation

C++ with optimizations

g++ -O3 -march = native -std = c++17 code.cpp -o code

Enable warnings

g++ -O3 -Wall -Wextra -pedantic code.cpp -o code

Link-time optimization

g++ -O3 -flto code.cpp -o code Profiling

Linux perf

perf stat ./code perf record ./code && perf report

Valgrind (memory profiling)

valgrind --tool = massif ./code

Google benchmark

./code --benchmark_format = console Verification

Run with sanitizers

g++ -fsanitize = address,undefined code.cpp -o code ./code

Compare output with reference

diff < ( ./reference ) < ( ./optimized ) Report Template Use this template for code_optimization/report.md :

Code Optimization Report: [Problem Name]

Baseline Performance

Implementation: [e.g., std::nth_element]

Execution time: [X] us

Memory usage: [Y] KB

Input size: N=[value], K=[value]

【优化版本】v1

【优化内容】 1. [具体优化措施1] 2. [具体优化措施2] 3. 原因：[为什么这样优化]

【优化后性能】

运行时间：从 [X] us 优化到 [Y] us

性能提升：[百分比]% 更快

内存占用：[Z] KB

【和标准库对比】

比基线快/慢 [差值] us（约 [倍数]x 倍速）

验证结果：[PASS/FAIL]

【优化版本】v2

【优化内容】 1. [具体优化措施1] 2. [具体优化措施2] 3. 原因：[为什么这样优化]

【优化后性能】

运行时间：从 [X] us 优化到 [Y] us

性能提升：相比 v1 [百分比]% 更快

内存占用：[Z] KB

【和标准库对比】

比基线快/慢 [差值] us（约 [倍数]x 倍速）

验证结果：[PASS/FAIL]

最终总结 (已达最大迭代次数: 2轮)

最佳版本：[vX]

总体性能提升：[百分比]%

最终加速比：[X]x

迭代次数：2 轮（已达上限）

优化策略：[列出关键技术]

适用场景：[说明最佳使用场景]

权衡考虑：[列出 trade-offs]

进一步优化建议：[如果时间允许，可以尝试的方向] Resources Compiler optimizations: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html SIMD programming: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/ Performance analysis: https://perf.wiki.kernel.org/ Algorithmic complexity: https://www.bigocheatsheet.com/ Remember: Performance optimization is an iterative process. You are limited to 2 optimization iterations maximum. Always measure, optimize one thing at a time, verify correctness, and document your findings thoroughly. Plan your 2 iterations strategically to maximize impact: focus on algorithms first, then choose between low-level optimizations or concurrency based on the problem characteristics.

安装

Read code file

content

Analyze and implement optimization

Fill in the my_topk_inplace function with optimized implementation

Compile with optimization flags

Run and capture output

Java

Rust

Go

Save to code_optimization directory

【优化后性能】

运行时间：从 1234 us 优化到 567 us

性能提升：54% 更快

【和标准库对比】

比 std::nth_element 快 667 us（约 2.18x 倍速）

验证结果：PASS（输出与标准库完全一致）

【优化后性能】

运行时间：从 567 us 优化到 312 us

性能提升：相比 v1 快 45%

【和标准库对比】

比 std::nth_element 快 922 us（约 3.95x 倍速）

验证结果：PASS

最佳版本：v2 (达到最大迭代次数)

C++ with optimizations

Enable warnings

Link-time optimization

Linux perf

Valgrind (memory profiling)

Google benchmark

Run with sanitizers

Compare output with reference

Baseline Performance

Implementation: [e.g., std::nth_element]

Execution time: [X] us

Memory usage: [Y] KB

Input size: N=[value], K=[value]

【优化后性能】

运行时间：从 [X] us 优化到 [Y] us

性能提升：[百分比]% 更快

【和标准库对比】

比基线快/慢 [差值] us（约 [倍数]x 倍速）

验证结果：[PASS/FAIL]

【优化后性能】

运行时间：从 [X] us 优化到 [Y] us

性能提升：相比 v1 [百分比]% 更快

【和标准库对比】

比基线快/慢 [差值] us（约 [倍数]x 倍速）

验证结果：[PASS/FAIL]

最终总结 (已达最大迭代次数: 2轮)

最佳版本：[vX]

总体性能提升：[百分比]%

最终加速比：[X]x

迭代次数：2 轮（已达上限）

优化策略：[列出关键技术]

适用场景：[说明最佳使用场景]

权衡考虑：[列出 trade-offs]