Python Performance Optimization Overview Master performance optimization in Python. Learn to profile code, identify bottlenecks, optimize algorithms, manage memory efficiently, and leverage high-performance libraries for compute-intensive tasks. Learning Objectives Profile Python code to identify bottlenecks Optimize algorithms and data structures Manage memory efficiently Use compiled extensions (Cython, NumPy) Implement caching strategies Parallelize CPU-bound operations Benchmark and measure improvements Core Topics 1. Profiling & Benchmarking timeit module for micro-benchmarks cProfile for function-level profiling line_profiler for line-by-line analysis memory_profiler for memory usage py-spy for production profiling Flame graphs and visualization Code Example: import timeit import cProfile import pstats

1. timeit for micro-benchmarks

def list_comprehension ( ) : return [ x ** 2 for x in range ( 1000 ) ] def map_function ( ) : return list ( map ( lambda x : x ** 2 , range ( 1000 ) ) )

Compare performance

time_lc

timeit . timeit ( list_comprehension , number = 10000 ) time_map = timeit . timeit ( map_function , number = 10000 ) print ( f"List comprehension: { time_lc : .4f } s" ) print ( f"Map function: { time_map : .4f } s" )

2. cProfile for function profiling

def process_data ( ) : data = [ ] for i in range ( 100000 ) : data . append ( i ** 2 ) return sum ( data ) profiler = cProfile . Profile ( ) profiler . enable ( ) result = process_data ( ) profiler . disable ( ) stats = pstats . Stats ( profiler ) stats . sort_stats ( 'cumulative' ) stats . print_stats ( 10 )

3. Line profiling (requires line_profiler package)

@profile decorator (add manually for line_profiler)

def slow_function ( ) : total = 0 for i in range ( 1000000 ) : total += i ** 2 return total

Run with: kernprof -l -v script.py

4. Memory profiling

from memory_profiler import profile @profile def memory_intensive ( ) : large_list = [ i for i in range ( 1000000 ) ] large_dict = { i : i ** 2 for i in range ( 1000000 ) } return len ( large_list ) + len ( large_dict )

Run with: python -m memory_profiler script.py

Algorithm & Data Structure Optimization Choosing efficient data structures Time complexity analysis Generator expressions vs lists Set operations for lookups Deque for queue operations Bisect for sorted lists Code Example: import bisect from collections import deque , Counter , defaultdict import time

1. List vs Set for membership testing

Bad: O(n) lookup

def find_in_list ( items , target ) : return target in items

Linear search

Good: O(1) lookup

def find_in_set ( items , target ) : items_set = set ( items ) return target in items_set items = list ( range ( 100000 ) )

List: 0.001s, Set: 0.000001s (1000x faster!)

2. Generator expressions for memory efficiency

Bad: Creates entire list in memory

squares_list

[ x ** 2 for x in range ( 1000000 ) ]

~4MB

Good: Generates on-demand

squares_gen

( x ** 2 for x in range ( 1000000 ) )

~128 bytes

3. Deque for efficient queue operations

Bad: O(n) pop from beginning

queue_list

list ( range ( 10000 ) ) queue_list . pop ( 0 )

Slow

Good: O(1) pop from both ends

queue_deque

deque ( range ( 10000 ) ) queue_deque . popleft ( )

Fast

4. Bisect for maintaining sorted lists

Bad: O(n) insertion into sorted list

sorted_list

[ ] for i in [ 5 , 2 , 8 , 1 , 9 ] : sorted_list . append ( i ) sorted_list . sort ( )

Good: O(log n) insertion

sorted_list

[ ] for i in [ 5 , 2 , 8 , 1 , 9 ] : bisect . insort ( sorted_list , i )

5. Counter for frequency counting

Bad: Manual counting

word_count

{ } for word in words : if word in word_count : word_count [ word ] += 1 else : word_count [ word ] = 1

Good: Counter

word_count

Counter ( words ) most_common = word_count . most_common ( 10 ) 3. Memory Management Memory allocation and garbage collection Object pooling Slots for memory-efficient classes Reference counting Weak references Memory leaks detection Code Example: import gc import sys from weakref import WeakValueDictionary

1. slots for memory-efficient classes

Bad: Regular class (56 bytes per instance)

class RegularPoint : def init ( self , x , y ) : self . x = x self . y = y

Good: Slots class (32 bytes per instance - 43% smaller!)

class SlottedPoint : slots = [ 'x' , 'y' ] def init ( self , x , y ) : self . x = x self . y = y print ( sys . getsizeof ( RegularPoint ( 1 , 2 ) ) )

56 bytes

print ( sys . getsizeof ( SlottedPoint ( 1 , 2 ) ) )

32 bytes

2. Object pooling for expensive objects

class ObjectPool : def init ( self , factory , max_size = 10 ) : self . factory = factory self . max_size = max_size self . pool = [ ] def acquire ( self ) : if self . pool : return self . pool . pop ( ) return self . factory ( ) def release ( self , obj ) : if len ( self . pool ) < self . max_size : self . pool . append ( obj )

Usage

db_pool

ObjectPool ( lambda : DatabaseConnection ( ) , max_size = 5 ) conn = db_pool . acquire ( )

Use connection

db_pool . release ( conn )

3. Weak references to prevent memory leaks

class Cache : def init ( self ) : self . _cache = WeakValueDictionary ( ) def get ( self , key ) : return self . _cache . get ( key ) def set ( self , key , value ) : self . _cache [ key ] = value

4. Manual garbage collection for large operations

def process_large_dataset ( ) : for batch in large_data : process_batch ( batch )

Force garbage collection after each batch

gc . collect ( )

5. Context managers for resource cleanup

class ManagedResource : def enter ( self ) : self . resource = allocate_resource ( ) return self . resource def exit ( self , exc_type , exc_val , exc_tb ) : self . resource . cleanup ( ) return False 4. High-Performance Computing NumPy vectorization Numba JIT compilation Cython for C extensions Multiprocessing for parallelism Concurrent.futures Performance comparison Code Example: import numpy as np from numba import jit import multiprocessing as mp from concurrent . futures import ProcessPoolExecutor

1. NumPy vectorization

Bad: Python loops (slow)

def python_sum ( n ) : total = 0 for i in range ( n ) : total += i ** 2 return total

Good: NumPy vectorization (100x faster!)

def numpy_sum ( n ) : arr = np . arange ( n ) return np . sum ( arr ** 2 )

Benchmark: python_sum(1000000) = 0.15s

numpy_sum(1000000) = 0.002s

2. Numba JIT compilation

@jit ( nopython = True )

Compile to machine code

def fast_function ( n ) : total = 0 for i in range ( n ) : total += i ** 2 return total

First call: compilation + execution

Subsequent calls: 50x faster than pure Python!

3. Multiprocessing for CPU-bound tasks

def cpu_intensive_task ( n ) : return sum ( i * i for i in range ( n ) )

Single process

result

cpu_intensive_task ( 10000000 )

Multiple processes

with ProcessPoolExecutor ( max_workers = 4 ) as executor : ranges = [ 2500000 , 2500000 , 2500000 , 2500000 ] results = executor . map ( cpu_intensive_task , ranges ) total = sum ( results )

4x speedup on 4 cores!

4. Caching for expensive computations

from functools import lru_cache @lru_cache ( maxsize = 128 ) def fibonacci ( n ) : if n < 2 : return n return fibonacci ( n - 1 ) + fibonacci ( n - 2 )

fibonacci(100) without cache: ~forever

fibonacci(100) with cache: instant

5. Memory views for zero-copy operations

def process_array ( data ) :

Bad: Creates copy

subset

data [ 1000 : 2000 ]

Good: Zero-copy view

view

memoryview ( data ) [ 1000 : 2000 ] Hands-On Practice Project 1: Performance Profiler Build a comprehensive profiling tool. Requirements: CPU profiling with cProfile Memory profiling Line-by-line analysis Visualization (flame graphs) HTML report generation Bottleneck identification Key Skills: Profiling tools, visualization, analysis Project 2: Data Processing Pipeline Optimize data processing pipeline. Requirements: Load large CSV files (1GB+) Transform and clean data Aggregate statistics Compare Python/NumPy/Pandas approaches Measure memory usage Optimize to <2GB RAM Key Skills: NumPy, memory optimization, benchmarking Project 3: Parallel Computing Implement parallel algorithms. Requirements: Matrix multiplication Image processing Monte Carlo simulation Compare threading/multiprocessing/asyncio Measure speedup Handle shared state Key Skills: Parallelism, performance measurement Assessment Criteria Profile code to identify bottlenecks Choose appropriate data structures Optimize algorithms for time complexity Manage memory efficiently Use vectorization where applicable Implement effective caching Parallelize CPU-bound operations Resources Official Documentation Python Performance Tips - Official tips NumPy Docs - NumPy documentation Numba Docs - JIT compilation Learning Platforms High Performance Python - O'Reilly book Python Performance - Real Python guide Optimizing Python - PyCon talks Tools cProfile - CPU profiling memory_profiler - Memory profiling py-spy - Sampling profiler Scalene - CPU/GPU/memory profiler Next Steps After mastering Python performance, explore: Cython - C extensions for Python PyPy - Alternative Python interpreter Dask - Parallel computing library CUDA - GPU programming with Python

安装

1. timeit for micro-benchmarks

Compare performance

time_lc

2. cProfile for function profiling

3. Line profiling (requires line_profiler package)

@profile decorator (add manually for line_profiler)

Run with: kernprof -l -v script.py

4. Memory profiling

Run with: python -m memory_profiler script.py

1. List vs Set for membership testing

Bad: O(n) lookup

Linear search

Good: O(1) lookup

List: 0.001s, Set: 0.000001s (1000x faster!)

2. Generator expressions for memory efficiency

Bad: Creates entire list in memory

squares_list

~4MB

Good: Generates on-demand

squares_gen

~128 bytes

3. Deque for efficient queue operations

Bad: O(n) pop from beginning

queue_list

Slow

Good: O(1) pop from both ends

queue_deque

Fast

4. Bisect for maintaining sorted lists

Bad: O(n) insertion into sorted list

sorted_list

Good: O(log n) insertion

sorted_list

5. Counter for frequency counting

Bad: Manual counting

word_count

Good: Counter

word_count

1. slots for memory-efficient classes

Bad: Regular class (56 bytes per instance)

Good: Slots class (32 bytes per instance - 43% smaller!)

56 bytes

32 bytes

2. Object pooling for expensive objects

Usage

db_pool

Use connection

3. Weak references to prevent memory leaks

4. Manual garbage collection for large operations

Force garbage collection after each batch

5. Context managers for resource cleanup

1. NumPy vectorization

Bad: Python loops (slow)

Good: NumPy vectorization (100x faster!)

Benchmark: python_sum(1000000) = 0.15s

numpy_sum(1000000) = 0.002s

2. Numba JIT compilation

Compile to machine code

First call: compilation + execution

Subsequent calls: 50x faster than pure Python!

3. Multiprocessing for CPU-bound tasks

Single process

result

Multiple processes

4x speedup on 4 cores!

4. Caching for expensive computations

fibonacci(100) without cache: ~forever

fibonacci(100) with cache: instant

5. Memory views for zero-copy operations

Bad: Creates copy

subset

Good: Zero-copy view

view