Infrastructure Documenter Skill Overview
This skill helps you create clear, maintainable infrastructure documentation. Covers architecture diagrams, runbooks, system documentation, operational procedures, and documentation-as-code practices.
Documentation Philosophy Principles Living documentation: Keep it in sync with reality Audience-aware: Different docs for different readers Actionable: Every doc should help someone do something Version-controlled: Documentation changes tracked with code Document Types Type Audience Purpose Architecture Engineers Understand system design Runbooks Ops/SRE Handle incidents API Docs Developers Integrate with system Onboarding New hires Get up to speed Decision Records Future you Understand why Architecture Documentation System Architecture Overview
System Architecture
Overview
[Project Name] is a [type] application that [purpose].
High-Level Architecture
┌─────────────────────────────────────────────────────────────┐ │ Users │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Vercel Edge │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Next.js App │ │ Edge Functions │ │ │ └─────────────────┘ └─────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Supabase │ │ Redis │ │ Stripe │ │ - PostgreSQL │ │ - Session │ │ - Payments │ │ - Auth │ │ - Cache │ │ - Webhooks │ │ - Realtime │ │ │ │ │ │ - Storage │ │ │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘
Components
Frontend (Next.js App)
- Location: Vercel Edge Network
- Framework: Next.js 14 (App Router)
- Styling: Tailwind CSS + shadcn/ui
- State: Zustand + React Query
Backend Services
| Service | Provider | Purpose |
|---|---|---|
| Database | Supabase | PostgreSQL with RLS |
| Auth | Supabase Auth | User authentication |
| Storage | Supabase Storage | File uploads |
| Cache | Upstash Redis | Session & API cache |
| Payments | Stripe | Subscriptions |
| Resend | Transactional emails |
Data Flow
- User request → Vercel Edge
- SSR/API Route processes request
- Database queries via Supabase client
- Response cached at edge (when applicable)
- Response returned to user
Security
Authentication Flow
- User signs in via Supabase Auth
- JWT token issued and stored in cookie
- Server validates token on each request
- RLS policies enforce data access
Data Protection
- All data encrypted at rest (AES-256)
- TLS 1.3 for data in transit
- Secrets stored in Vercel environment
- PII fields encrypted in database
Mermaid Diagrams
Request Flow
```mermaid sequenceDiagram participant U as User participant V as Vercel participant N as Next.js participant S as Supabase participant R as Redis
U->>V: HTTPS Request
V->>N: Route to App
alt Cached Response
N->>R: Check Cache
R-->>N: Cache Hit
N-->>U: Return Cached
else Cache Miss
N->>S: Query Database
S-->>N: Data
N->>R: Store in Cache
N-->>U: Return Response
end
Database Schema erDiagram users ||--o{ projects : owns users { uuid id PK text email text name timestamp created_at } projects ||--o{ tasks : contains projects { uuid id PK uuid user_id FK text name text status } tasks { uuid id PK uuid project_id FK text title boolean completed }
Runbooks
Runbook Template
```markdown
Runbook: [Service Name] - [Issue Type]
Overview
Brief description of the issue and when this runbook applies.
Severity
- P1 (Critical): Complete outage
- P2 (High): Degraded service
- P3 (Medium): Minor impact
- P4 (Low): No user impact
Detection
How this issue is typically detected: - [ ] Alert from [monitoring system] - [ ] User report - [ ] Automated check failure
Impact Assessment
- Users affected: All / Segment / None
- Data at risk: Yes / No
- Revenue impact: High / Medium / Low / None
Prerequisites
- [ ] Access to [system/dashboard]
- [ ] Credentials for [service]
- [ ] Contact info for [team/person]
Resolution Steps
Step 1: Verify the Issue
```bash
Check service status
curl -I https://api.example.com/health
Check logs
vercel logs --follow
Step 2: Identify Root Cause
Common causes:
Database connection pool exhausted Memory limit reached External service down Bad deployment Step 3: Apply Fix If Database Issue:
Check connection count
SELECT count(*) FROM pg_stat_activity;
Kill idle connections
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '1 hour';
If Bad Deployment:
Rollback to previous deployment
vercel rollback
Step 4: Verify Fix
Check service health
curl https://api.example.com/health
Monitor error rates for 15 minutes
Escalation
If unable to resolve within 30 minutes:
Page on-call engineer: [contact] Notify stakeholders in #incidents Update status page Post-Incident Create incident report Schedule post-mortem (P1/P2 only) Update this runbook if needed Related Links Dashboard Logs Metrics
Database Runbooks
```markdown
Runbook: Database Performance Issues
Symptoms
- Slow API responses (>1s)
- Timeout errors in logs
- High database CPU in dashboard
Quick Checks
1. Check Active Connections
```sql SELECT state, count(*), max(now() - query_start) as max_duration FROM pg_stat_activity GROUP BY state;
-
Find Long-Running Queries SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '30 seconds' ORDER BY duration DESC;
-
Check Table Sizes SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as size FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC LIMIT 10;
-
Check Missing Indexes SELECT relname, seq_scan, idx_scan, seq_scan - idx_scan AS difference FROM pg_stat_user_tables WHERE seq_scan > idx_scan ORDER BY difference DESC;
Resolution Kill Problematic Queries SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid = [PID_FROM_ABOVE];
Add Missing Index CREATE INDEX CONCURRENTLY idx_table_column ON table_name (column_name);
Decision Records (ADRs)
ADR Template
```markdown
ADR-001: Choose Supabase for Database
Status
Accepted
Context
We need a database solution for [Project Name] that supports: - PostgreSQL compatibility - Real-time subscriptions - Built-in authentication - Easy local development - Generous free tier
Decision
We will use Supabase as our primary database and auth provider.
Alternatives Considered
PlanetScale
Pros: - Excellent scaling - Branching for schema changes - MySQL compatible
Cons: - No built-in auth - No real-time subscriptions - Additional services needed
Firebase
Pros: - Real-time built-in - Mature platform - Good mobile SDKs
Cons: - NoSQL (not ideal for our use case) - Vendor lock-in concerns - Complex security rules
Consequences
Positive
- Single provider for DB + Auth + Storage
- Great developer experience
- Row Level Security for data protection
- Local development with supabase CLI
Negative
- PostgreSQL-specific features tie us to provider
- Supabase still maturing (some rough edges)
- Limited to their managed offering
Risks
- Supabase scaling limitations at high traffic
- Migration cost if we need to move
References
API Documentation Endpoint Documentation
API Reference
Base URL
Production: https://api.example.com/v1 Staging: https://staging-api.example.com/v1
Authentication
All API requests require authentication via Bearer token.
```bash curl -H "Authorization: Bearer YOUR_TOKEN" \ https://api.example.com/v1/users
Endpoints Users Get Current User GET /users/me
Response:
{ "id": "usr_123", "email": "user@example.com", "name": "John Doe", "created_at": "2024-01-01T00:00:00Z" }
Update User PATCH /users/me
Request Body:
Field Type Required Description name string No Display name avatar_url string No Profile image URL
Example:
curl -X PATCH \ -H "Authorization: Bearer YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{"name": "Jane Doe"}' \ https://api.example.com/v1/users/me
Error Responses Status Code Description 400 BAD_REQUEST Invalid request body 401 UNAUTHORIZED Missing or invalid token 403 FORBIDDEN Insufficient permissions 404 NOT_FOUND Resource not found 429 RATE_LIMITED Too many requests 500 INTERNAL_ERROR Server error
Error Response Format:
{ "error": { "code": "NOT_FOUND", "message": "User not found" } }
Environment Documentation
Environment Matrix
```markdown
Environments
Overview
| Environment | URL | Purpose | Deploy |
|---|---|---|---|
| Production | https://myapp.com | Live users | Manual (main) |
| Staging | https://staging.myapp.com | Pre-release testing | Auto (main) |
| Preview | https://pr-*.vercel.app | PR review | Auto (PR) |
| Development | http://localhost:3000 | Local dev | Manual |
Configuration
Production
```env NODE_ENV=production DATABASE_URL=[Supabase Production] NEXT_PUBLIC_APP_URL=https://myapp.com
Staging NODE_ENV=production DATABASE_URL=[Supabase Staging Branch] NEXT_PUBLIC_APP_URL=https://staging.myapp.com
Development NODE_ENV=development DATABASE_URL=[Local Supabase] NEXT_PUBLIC_APP_URL=http://localhost:3000
Access Production Vercel: Admin only Database: Read-only for devs, write for admin Logs: All engineers Staging Vercel: All engineers Database: All engineers Logs: All engineers Secrets Rotation Secret Rotation Last Rotated Database password 90 days 2024-01-15 API keys 90 days 2024-01-15 JWT secret Never Initial setup
Documentation-as-Code
Documentation Structure
docs/ ├── README.md # Documentation index ├── architecture/ │ ├── overview.md # System architecture │ ├── data-flow.md # Data flow diagrams │ └── decisions/ # ADRs │ ├── 001-database.md │ └── 002-hosting.md ├── runbooks/ │ ├── README.md # Runbook index │ ├── database.md # Database issues │ ├── deployment.md # Deployment issues │ └── outage.md # Service outage ├── api/ │ └── reference.md # API documentation └── onboarding/ ├── setup.md # Local setup └── contributing.md # How to contribute
Auto-Generated Documentation
```yaml
.github/workflows/docs.yml
name: Generate Docs
on: push: branches: [main] paths: - 'src/' - 'docs/'
jobs: generate-docs: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4
- name: Generate API docs from OpenAPI
run: |
npx @redocly/cli build-docs openapi.yaml \
--output docs/api/index.html
- name: Generate TypeDoc
run: npx typedoc --out docs/api/typescript
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./docs
Documentation Checklist Architecture Docs System overview diagram Component descriptions Data flow documentation Security architecture Technology decisions (ADRs) Operational Docs Runbooks for common issues Deployment procedures Monitoring and alerting Incident response plan On-call procedures Developer Docs Local setup guide API reference Contributing guidelines Code conventions Testing guide Maintenance Documentation review schedule Ownership assigned Change process defined Versioning strategy When to Use This Skill
Invoke this skill when:
Creating architecture documentation Writing runbooks for operations Documenting decision rationale (ADRs) Setting up documentation structure Creating onboarding materials Building automated documentation Planning incident response procedures