System Design Generator

Create comprehensive system architecture plans from requirements.

System Design Document Template

System Design: [Feature/Product Name]

Overview

Brief description of what we're building and why.

Requirements

Functional

User can upload videos (max 1GB)
System processes video within 5 minutes
User receives notification when complete

Non-Functional

Handle 1000 uploads/day
99.9% uptime
Process videos in <5 minutes (p95)
Cost: <$0.50 per video

High-Level Architecture

┌─────────┐ ┌──────────┐ ┌─────────────┐ │ Client │─────▶│ API │─────▶│ Upload │ │ │ │ Gateway │ │ Service │ └─────────┘ └──────────┘ └─────────────┘ │ ▼ ┌─────────────┐ │ Storage │ │ (S3) │ └─────────────┘ │ ▼ ┌─────────────┐ │ Processing │◀─┐ │ Queue │ │ └─────────────┘ │ │ │ ▼ │ ┌─────────────┐ │ │ Processor │─┘ │ Workers │ └─────────────┘ │ ▼ ┌─────────────┐ │Notification │ │ Service │ └─────────────┘

Components

1. API Gateway

Responsibilities: - Authentication - Rate limiting - Request routing

Technology: Kong/AWS API Gateway Scaling: Auto-scale based on requests/sec

2. Upload Service

Responsibilities: - Generate pre-signed S3 URLs - Validate file metadata - Enqueue processing jobs

API:

POST /uploads Request: { filename, size, content_type } Response: { upload_url, upload_id }

Technology: Node.js + Express Scaling: Horizontal (stateless)

3. Storage (S3)

Responsibilities: - Store raw videos - Store processed outputs - Serve content via CDN

Structure:

/uploads/{user_id}/{upload_id}/original.mp4 /processed/{user_id}/{upload_id}/output.mp4

4. Processing Queue

Responsibilities: - Buffer processing jobs - Ensure at-least-once delivery - DLQ for failed jobs

Technology: AWS SQS Configuration: - Visibility timeout: 15 minutes - DLQ after 3 retries

5. Processor Workers

Responsibilities: - Transcode videos - Generate thumbnails - Update database

Technology: Python + FFmpeg Scaling: Auto-scale on queue depth

Data Flow

Upload Flow

Client requests upload URL from Upload Service
Upload Service generates pre-signed S3 URL
Client uploads directly to S3
Client notifies Upload Service of completion
Upload Service enqueues processing job
Returns upload_id to client

Processing Flow

Worker polls queue for jobs
Downloads video from S3
Processes video (transcode, thumbnail)
Uploads results to S3
Updates database status
Sends notification
Deletes message from queue

Data Model

```typescript interface Upload { id: string; user_id: string; filename: string; size: number; status: 'pending' | 'processing' | 'complete' | 'failed'; original_url: string; processed_url?: string; created_at: Date; processed_at?: Date; }

interface ProcessingJob { upload_id: string; attempts: number; error?: string; }

API Contract Upload Endpoints POST /uploads - Request upload URL GET /uploads/:id - Get upload status DELETE /uploads/:id - Cancel upload GET /uploads - List user uploads

Webhooks POST {webhook_url} { "event": "upload.completed", "upload_id": "...", "status": "complete", "processed_url": "..." }

Scaling Considerations Current Capacity 1000 uploads/day = ~1 per minute Single worker can process 1 video every 5 minutes Need 5 workers for current load 10x Scale (10,000/day) ~10 uploads per minute Need 50 workers Use spot instances for cost savings Add Redis cache for status checks 100x Scale (100,000/day) ~100 uploads per minute Partition by region Use Kafka instead of SQS Database sharding by user_id Failure Modes S3 Unavailable Impact: Uploads fail Mitigation: Multi-region S3 replication Queue Backed Up Impact: Processing delays Mitigation: Auto-scale workers faster Worker Crash During Processing Impact: Job retried Mitigation: Idempotent processing Cost Estimate

Monthly (1000 uploads/day):

S3 Storage: $50 S3 Transfer: $100 SQS: $10 Workers (EC2): $300 Database: $100 Total: ~$560/month Security Pre-signed URLs expire in 1 hour Videos in private S3 buckets CloudFront signed URLs for delivery Rate limiting per user Monitoring

Metrics:

Upload success rate Processing time (p50, p95, p99) Queue depth Worker CPU/memory Error rate by type

Alerts:

Queue depth >1000 Processing time p95 >10 minutes Error rate >5% Open Questions Video retention policy? (30 days? 1 year?) Maximum video duration? (affects processing time) Regional data residency requirements?

Component Template

```markdown

Component Name

Responsibilities: - Primary responsibility - Secondary responsibility

Technology Stack: - Language: [Python/Node/Go] - Framework: [Express/FastAPI/Gin] - Database: [PostgreSQL/MongoDB]

API/Interface: ```typescript interface ComponentAPI { method(params): ReturnType; }

Scaling Strategy:

Horizontal: Stateless, load balanced Vertical: Cache layer, connection pooling

Dependencies:

Service A (for X) Database B (for persistence)

Failure Handling:

Retry with exponential backoff Circuit breaker for downstream services Fallback to cached data

Best Practices

Start with requirements: Functional + non-functional
Draw diagrams first: Visual clarity
Define boundaries: What's in scope vs out
Document tradeoffs: Every choice has costs
Plan for failure: What breaks and how to handle
Consider scale: Current, 10x, 100x
Estimate costs: Build vs buy decisions
Leave open questions: Don't pretend to know everything

Output Checklist

[ ] Requirements documented (functional + non-functional)
[ ] High-level architecture diagram
[ ] Component breakdown (3-7 components)
[ ] Data flow documented
[ ] Data model defined
[ ] API contracts specified
[ ] Scaling considerations (1x, 10x, 100x)
[ ] Failure modes identified
[ ] Cost estimate provided
[ ] Security considerations
[ ] Monitoring plan

system-design-generator

安装