Computer Vision Pipeline

Expert in building production-ready computer vision systems for object detection, tracking, and video analysis.

When to Use

✅ Use for:

Drone footage analysis (archaeological surveys, conservation) Wildlife monitoring and tracking Real-time object detection systems Video preprocessing and analysis Custom model training and inference Multi-object tracking (MOT)

❌ NOT for:

Simple image filters (use Pillow/PIL) Photo editing (use Photoshop/GIMP) Face recognition APIs (use AWS Rekognition) Basic OCR (use Tesseract) Technology Selection Object Detection Models Model Speed (FPS) Accuracy (mAP) Use Case YOLOv8 140 53.9% Real-time detection Detectron2 25 58.7% High accuracy, research EfficientDet 35 55.1% Mobile deployment Faster R-CNN 10 42.0% Legacy systems

Timeline:

2015: Faster R-CNN (two-stage detection) 2016: YOLO v1 (one-stage, real-time) 2020: YOLOv5 (PyTorch, production-ready) 2023: YOLOv8 (state-of-the-art) 2024: YOLOv8 is industry standard for real-time

Decision tree:

Need real-time (>30 FPS)? → YOLOv8 Need highest accuracy? → Detectron2 Mask R-CNN Need mobile deployment? → YOLOv8-nano or EfficientDet Need instance segmentation? → Detectron2 or YOLOv8-seg Need custom objects? → Fine-tune YOLOv8

Common Anti-Patterns Anti-Pattern 1: Not Preprocessing Frames Before Detection

Novice thinking: "Just run detection on raw video frames"

Problem: Poor detection accuracy, wasted GPU cycles.

Wrong approach:

❌ No preprocessing - poor results

import cv2 from ultralytics import YOLO

model = YOLO('yolov8n.pt') video = cv2.VideoCapture('drone_footage.mp4')

while True: ret, frame = video.read() if not ret: break

# Raw frame detection - no normalization, no resizing
results = model(frame)
# Poor accuracy, slow inference

Why wrong:

Video resolution too high (4K = 8.3 megapixels per frame) No normalization (pixel values 0-255 instead of 0-1) Aspect ratio not maintained GPU memory overflow on high-res frames

Correct approach:

✅ Proper preprocessing pipeline

import cv2 import numpy as np from ultralytics import YOLO

model = YOLO('yolov8n.pt') video = cv2.VideoCapture('drone_footage.mp4')

Model expects 640x640 input

TARGET_SIZE = 640

def preprocess_frame(frame): # Resize while maintaining aspect ratio h, w = frame.shape[:2] scale = TARGET_SIZE / max(h, w) new_w, new_h = int(w * scale), int(h * scale)

resized = cv2.resize(frame, (new_w, new_h), interpolation=cv2.INTER_LINEAR)

# Pad to square
pad_w = (TARGET_SIZE - new_w) // 2
pad_h = (TARGET_SIZE - new_h) // 2

padded = cv2.copyMakeBorder(
    resized,
    pad_h, TARGET_SIZE - new_h - pad_h,
    pad_w, TARGET_SIZE - new_w - pad_w,
    cv2.BORDER_CONSTANT,
    value=(114, 114, 114)  # Gray padding
)

# Normalize to 0-1 (if model expects it)
# normalized = padded.astype(np.float32) / 255.0

return padded, scale

while True: ret, frame = video.read() if not ret: break

preprocessed, scale = preprocess_frame(frame)
results = model(preprocessed)

# Scale bounding boxes back to original coordinates
for box in results[0].boxes:
    x1, y1, x2, y2 = box.xyxy[0]
    x1, y1, x2, y2 = x1/scale, y1/scale, x2/scale, y2/scale

Performance comparison:

Raw 4K frames: 5 FPS, 72% mAP Preprocessed 640x640: 45 FPS, 89% mAP

Timeline context:

2015: Manual preprocessing required 2020: YOLOv5 added auto-resize 2023: YOLOv8 has smart preprocessing but explicit control is better Anti-Pattern 2: Processing Every Frame in Video

Novice thinking: "Run detection on every single frame"

Problem: 99% of frames are redundant, wasting compute.

Wrong approach:

❌ Process every frame (30 FPS video = 1800 frames/min)

import cv2 from ultralytics import YOLO

model = YOLO('yolov8n.pt') video = cv2.VideoCapture('drone_footage.mp4')

detections = []

while True: ret, frame = video.read() if not ret: break

# Run detection on EVERY frame
results = model(frame)
detections.append(results)

10-minute video = 18,000 inferences (15 minutes on GPU)

Why wrong:

Adjacent frames are nearly identical Wasting 95% of compute on duplicate work Slow processing time Massive storage for results

Correct approach 1: Frame sampling

✅ Sample every Nth frame

import cv2 from ultralytics import YOLO

model = YOLO('yolov8n.pt') video = cv2.VideoCapture('drone_footage.mp4')

SAMPLE_RATE = 30 # Process 1 frame per second (if 30 FPS video)

frame_count = 0 detections = []

while True: ret, frame = video.read() if not ret: break

frame_count += 1

# Only process every 30th frame
if frame_count % SAMPLE_RATE == 0:
    results = model(frame)
    detections.append({
        'frame': frame_count,
        'timestamp': frame_count / 30.0,
        'results': results
    })

10-minute video = 600 inferences (30 seconds on GPU)

Correct approach 2: Adaptive sampling with scene change detection

✅ Only process when scene changes significantly

import cv2 import numpy as np from ultralytics import YOLO

model = YOLO('yolov8n.pt') video = cv2.VideoCapture('drone_footage.mp4')

def scene_changed(prev_frame, curr_frame, threshold=0.3): """Detect scene change using histogram comparison""" if prev_frame is None: return True

# Convert to grayscale
prev_gray = cv2.cvtColor(prev_frame, cv2.COLOR_BGR2GRAY)
curr_gray = cv2.cvtColor(curr_frame, cv2.COLOR_BGR2GRAY)

# Calculate histograms
prev_hist = cv2.calcHist([prev_gray], [0], None, [256], [0, 256])
curr_hist = cv2.calcHist([curr_gray], [0], None, [256], [0, 256])

# Compare histograms
correlation = cv2.compareHist(prev_hist, curr_hist, cv2.HISTCMP_CORREL)

return correlation < (1 - threshold)

prev_frame = None detections = []

while True: ret, frame = video.read() if not ret: break

# Only run detection if scene changed
if scene_changed(prev_frame, frame):
    results = model(frame)
    detections.append(results)

prev_frame = frame.copy()

Adapts to video content - static shots skip frames, action scenes process more

Savings:

Every frame: 18,000 inferences Sample 1 FPS: 600 inferences (97% reduction) Adaptive: ~1,200 inferences (93% reduction) Anti-Pattern 3: Not Using Batch Inference

Novice thinking: "Process one image at a time"

Problem: GPU sits idle 80% of the time waiting for data.

Wrong approach:

❌ Sequential processing - GPU underutilized

import cv2 from ultralytics import YOLO import time

model = YOLO('yolov8n.pt')

100 images to process

image_paths = [f'frame_{i:04d}.jpg' for i in range(100)]

start = time.time()

for path in image_paths: frame = cv2.imread(path) results = model(frame) # Process one at a time # GPU utilization: ~20%

elapsed = time.time() - start print(f"Processed {len(image_paths)} images in {elapsed:.2f}s")

Output: 45 seconds

Why wrong:

GPU has to wait for CPU to load each image No parallelization GPU utilization ~20% Slow throughput

Correct approach:

✅ Batch inference - GPU fully utilized

import cv2 from ultralytics import YOLO import time

model = YOLO('yolov8n.pt')

image_paths = [f'frame_{i:04d}.jpg' for i in range(100)]

BATCH_SIZE = 16 # Process 16 images at once

start = time.time()

for i in range(0, len(image_paths), BATCH_SIZE): batch_paths = image_paths[i:i+BATCH_SIZE]

# Load batch
frames = [cv2.imread(path) for path in batch_paths]

# Batch inference (single GPU call)
results = model(frames)  # Pass list of images
# GPU utilization: ~85%

elapsed = time.time() - start print(f"Processed {len(image_paths)} images in {elapsed:.2f}s")

Output: 8 seconds (5.6x faster!)

Performance comparison:

Method Time (100 images) GPU Util Throughput Sequential 45s 20% 2.2 img/s Batch (16) 8s 85% 12.5 img/s Batch (32) 6s 92% 16.7 img/s

Batch size tuning:

Find optimal batch size for your GPU

import torch

def find_optimal_batch_size(model, image_size=(640, 640)): for batch_size in [1, 2, 4, 8, 16, 32, 64]: try: dummy_input = torch.randn(batch_size, 3, *image_size).cuda()

        start = time.time()
        with torch.no_grad():
            _ = model(dummy_input)
        elapsed = time.time() - start

        throughput = batch_size / elapsed
        print(f"Batch {batch_size}: {throughput:.1f} img/s")
    except RuntimeError as e:
        print(f"Batch {batch_size}: OOM (out of memory)")
        break

Find optimal batch size before production

find_optimal_batch_size(model)

Anti-Pattern 4: Ignoring Non-Maximum Suppression (NMS) Tuning

Problem: Duplicate detections, missed objects, slow post-processing.

Wrong approach:

❌ Use default NMS settings for everything

from ultralytics import YOLO

model = YOLO('yolov8n.pt')

Default settings (iou_threshold=0.45, conf_threshold=0.25)

results = model('crowded_scene.jpg')

Result: 50 bounding boxes, 30 are duplicates!

Why wrong:

Default IoU=0.45 is too permissive for dense objects Default conf=0.25 includes low-quality detections No adaptation to use case

Correct approach:

✅ Tune NMS for your use case

from ultralytics import YOLO

model = YOLO('yolov8n.pt')

Sparse objects (dolphins in ocean)

sparse_results = model( 'ocean_footage.jpg', iou=0.5, # Higher IoU = allow closer boxes conf=0.4 # Higher confidence = fewer false positives )

Dense objects (crowd, flock of birds)

dense_results = model( 'crowded_scene.jpg', iou=0.3, # Lower IoU = suppress more duplicates conf=0.5 # Higher confidence = filter noise )

High precision needed (legal evidence)

precise_results = model( 'evidence.jpg', iou=0.5, conf=0.7, # Very high confidence max_det=50 # Limit max detections )

NMS parameter guide:

Use Case IoU Conf Max Det Sparse objects (wildlife) 0.5 0.4 100 Dense objects (crowd) 0.3 0.5 300 High precision (evidence) 0.5 0.7 50 Real-time (speed priority) 0.45 0.3 100 Anti-Pattern 5: No Tracking Between Frames

Novice thinking: "Run detection on each frame independently"

Problem: Can't count unique objects, track movement, or build trajectories.

Wrong approach:

❌ Independent frame detection - no object identity

from ultralytics import YOLO import cv2

model = YOLO('yolov8n.pt') video = cv2.VideoCapture('dolphins.mp4')

detections = []

while True: ret, frame = video.read() if not ret: break

results = model(frame)
detections.append(results)

Result: Can't tell if frame 10 dolphin is same as frame 20 dolphin

Can't count unique dolphins

Can't track trajectories

Why wrong:

No object identity across frames Can't count unique objects Can't analyze movement patterns Can't build trajectories

Correct approach: Use tracking (ByteTrack)

✅ Multi-object tracking with ByteTrack

from ultralytics import YOLO import cv2

YOLO with tracking

model = YOLO('yolov8n.pt') video = cv2.VideoCapture('dolphins.mp4')

Track objects across frames

tracks = {}

while True: ret, frame = video.read() if not ret: break

# Run detection + tracking
results = model.track(
    frame,
    persist=True,     # Maintain IDs across frames
    tracker='bytetrack.yaml'  # ByteTrack algorithm
)

# Each detection now has persistent ID
for box in results[0].boxes:
    track_id = int(box.id[0])  # Unique ID across frames
    x1, y1, x2, y2 = box.xyxy[0]

    # Store trajectory
    if track_id not in tracks:
        tracks[track_id] = []

    tracks[track_id].append({
        'frame': len(tracks[track_id]),
        'bbox': (x1, y1, x2, y2),
        'conf': box.conf[0]
    })

Now we can analyze:

print(f"Unique dolphins detected: {len(tracks)}")

Trajectory analysis

for track_id, trajectory in tracks.items(): if len(trajectory) > 30: # Only long tracks print(f"Dolphin {track_id} appeared in {len(trajectory)} frames") # Calculate movement, speed, etc.

Tracking benefits:

Count unique objects (not just detections per frame) Build trajectories and movement patterns Analyze behavior over time Filter out brief false positives

Tracking algorithms:

Algorithm Speed Robustness Occlusion Handling ByteTrack Fast Good Excellent SORT Very Fast Fair Fair DeepSORT Medium Excellent Good BotSORT Medium Excellent Excellent Production Checklist □ Preprocess frames (resize, pad, normalize) □ Sample frames intelligently (1 FPS or scene change detection) □ Use batch inference (16-32 images per batch) □ Tune NMS thresholds for your use case □ Implement tracking if analyzing video □ Log inference time and GPU utilization □ Handle edge cases (empty frames, corrupted video) □ Save results in structured format (JSON, CSV) □ Visualize detections for debugging □ Benchmark on representative data

When to Use vs Avoid Scenario Appropriate? Analyze drone footage for archaeology ✅ Yes - custom object detection Track wildlife in video ✅ Yes - detection + tracking Count people in crowd ✅ Yes - dense object detection Real-time security camera ✅ Yes - YOLOv8 real-time Filter vacation photos ❌ No - use photo management apps Face recognition login ❌ No - use AWS Rekognition API Read license plates ❌ No - use specialized OCR References /references/yolo-guide.md - YOLOv8 setup, training, inference patterns /references/video-processing.md - Frame extraction, scene detection, optimization /references/tracking-algorithms.md - ByteTrack, SORT, DeepSORT comparison Scripts scripts/video_analyzer.py - Extract frames, run detection, generate timeline scripts/model_trainer.py - Fine-tune YOLO on custom dataset, export weights

computer-vision-pipeline

安装

❌ No preprocessing - poor results

✅ Proper preprocessing pipeline

Model expects 640x640 input

❌ Process every frame (30 FPS video = 1800 frames/min)

10-minute video = 18,000 inferences (15 minutes on GPU)

✅ Sample every Nth frame

10-minute video = 600 inferences (30 seconds on GPU)

✅ Only process when scene changes significantly

Adapts to video content - static shots skip frames, action scenes process more

❌ Sequential processing - GPU underutilized

100 images to process

Output: 45 seconds

✅ Batch inference - GPU fully utilized

Output: 8 seconds (5.6x faster!)

Find optimal batch size for your GPU

Find optimal batch size before production

❌ Use default NMS settings for everything

Default settings (iou_threshold=0.45, conf_threshold=0.25)

Result: 50 bounding boxes, 30 are duplicates!

✅ Tune NMS for your use case

Sparse objects (dolphins in ocean)

Dense objects (crowd, flock of birds)

High precision needed (legal evidence)

❌ Independent frame detection - no object identity

Result: Can't tell if frame 10 dolphin is same as frame 20 dolphin

Can't count unique dolphins

Can't track trajectories

✅ Multi-object tracking with ByteTrack

YOLO with tracking

Track objects across frames

Now we can analyze:

Trajectory analysis