kafka-engineer

安装量: 193
排名: #4441

安装

npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill kafka-engineer

Kafka Engineer Purpose

Provides Apache Kafka and event streaming expertise specializing in scalable event-driven architectures and real-time data pipelines. Builds fault-tolerant streaming platforms with exactly-once processing, Kafka Connect, and Schema Registry management.

When to Use Designing event-driven microservices architectures Setting up Kafka Connect pipelines (CDC, S3 Sink) Writing stream processing apps (Kafka Streams / ksqlDB) Debugging consumer lag, rebalancing storms, or broker performance Designing schemas (Avro/Protobuf) with Schema Registry Configuring ACLs and mTLS security 2. Decision Framework Architecture Selection What is the use case? │ ├─ Data Integration (ETL) │ ├─ DB to DB/Data Lake? → Kafka Connect (Zero code) │ └─ Complex transformations? → Kafka Streams │ ├─ Real-Time Analytics │ ├─ SQL-like queries? → ksqlDB (Quick aggregation) │ └─ Complex stateful logic? → Kafka Streams / Flink │ └─ Microservices Comm ├─ Event Notification? → Standard Producer/Consumer └─ Event Sourcing? → State Stores (RocksDB)

Config Tuning (The "Big 3") Throughput: batch.size, linger.ms, compression.type=lz4. Latency: linger.ms=0, acks=1. Durability: acks=all, min.insync.replicas=2, replication.factor=3.

Red Flags → Escalate to sre-engineer:

"Unclean leader election" enabled (Data loss risk) Zookeeper dependency in new clusters (Use KRaft mode) Disk usage > 80% on brokers Consumer lag constantly increasing (Capacity mismatch) 3. Core Workflows Workflow 1: Kafka Connect (CDC)

Goal: Stream changes from PostgreSQL to S3.

Steps:

Source Config (postgres-source.json)

{ "name": "postgres-source", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "db-host", "database.dbname": "mydb", "database.user": "kafka", "plugin.name": "pgoutput" } }

Sink Config (s3-sink.json)

{ "name": "s3-sink", "config": { "connector.class": "io.confluent.connect.s3.S3SinkConnector", "s3.bucket.name": "my-datalake", "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat", "flush.size": "1000" } }

Deploy

curl -X POST -d @postgres-source.json http://connect:8083/connectors Workflow 3: Schema Registry Integration

Goal: Enforce schema compatibility.

Steps:

Define Schema (user.avsc)

{ "type": "record", "name": "User", "fields": [ {"name": "id", "type": "int"}, {"name": "name", "type": "string"} ] }

Producer (Java)

Use KafkaAvroSerializer. Registry URL: http://schema-registry:8081. 5. Anti-Patterns & Gotchas ❌ Anti-Pattern 1: Large Messages

What it looks like:

Sending 10MB images payload in Kafka message.

Why it fails:

Kafka is optimized for small messages (< 1MB). Large messages block the broker threads.

Correct approach:

Store image in S3. Send Reference URL in Kafka message. ❌ Anti-Pattern 2: Too Many Partitions

What it looks like:

Creating 10,000 partitions on a small cluster.

Why it fails:

Slow leader election (Zookeeper overhead). High file handle usage.

Correct approach:

Limit partitions per broker (~4000). Use fewer topics or larger clusters. ❌ Anti-Pattern 3: Blocking Consumer

What it looks like:

Consumer doing heavy HTTP call (30s) for each message.

Why it fails:

Rebalance storm (Consumer leaves group due to timeout).

Correct approach:

Async Processing: Move work to a thread pool. Pause/Resume: consumer.pause() if buffer is full. 7. Quality Checklist

Configuration:

Replication: Factor 3 for production. Min.ISR: 2 (Prevents data loss). Retention: Configured correctly (Time vs Size).

Observability:

Lag: Consumer Lag monitored (Burrow/Prometheus). Under-replicated: Alert on under-replicated partitions (>0). JMX: Metrics exported. Examples Example 1: Real-Time Fraud Detection Pipeline

Scenario: A financial services company needs real-time fraud detection using Kafka streaming.

Architecture Implementation:

Event Ingestion: Kafka Connect CDC from PostgreSQL transaction database Stream Processing: Kafka Streams application for real-time pattern detection Alert System: Producer to alert topic triggering notifications Storage: S3 sink for historical analysis and compliance

Pipeline Configuration:

Component Configuration Purpose Topics 3 (transactions, alerts, enriched) Data organization Partitions 12 (3 brokers × 4) Parallelism Replication 3 High availability Compression LZ4 Throughput optimization

Key Logic:

Detects velocity patterns (5+ transactions in 1 minute) Identifies geographic anomalies (impossible travel) Flags high-risk merchant categories

Results:

99.7% of fraud detected in under 100ms False positive rate reduced from 5% to 0.3% Compliance audit passed with zero findings Example 2: E-Commerce Order Processing System

Scenario: Build a resilient order processing system with Kafka for high reliability.

System Design:

Order Events: Topic for order lifecycle events Inventory Service: Consumes orders, updates stock Payment Service: Processes payments, publishes results Notification Service: Sends confirmations via email/SMS

Resilience Patterns:

Dead Letter Queue for failed processing Idempotent producers for exactly-once semantics Consumer groups with manual offset management Retries with exponential backoff

Configuration:

Producer Configuration

acks: all retries: 3 enable.idempotence: true

Consumer Configuration

auto.offset.reset: earliest enable.auto.commit: false max.poll.records: 500

Results:

99.99% message delivery reliability Zero duplicate orders in 6 months Peak processing: 10,000 orders/second Example 3: IoT Telemetry Platform

Scenario: Process millions of IoT device telemetry messages with Kafka.

Platform Architecture:

Device Gateway: MQTT to Kafka proxy Data Enrichment: Stream processing adds device metadata Time-Series Storage: S3 sink partitioned by device_id/date Real-Time Alerts: Threshold-based alerting for anomalies

Scalability Configuration:

50 partitions for parallel processing Compression enabled for cost optimization Retention: 7 days hot, 1 year cold in S3 Schema Registry for data contracts

Performance Metrics:

Metric Value Throughput 500,000 messages/sec Latency (P99) 50ms Consumer lag < 1 second Storage efficiency 60% reduction with compression Best Practices Topic Design Naming Conventions: Use clear, hierarchical topic names (domain.entity.event) Partition Strategy: Plan for future growth (3x expected throughput) Retention Policies: Match retention to business requirements Cleanup Policies: Use delete for time-based, compact for state Schema Management: Enforce schemas via Schema Registry Producer Optimization Batching: Increase batch.size and linger.ms for throughput Compression: Use LZ4 for balance of speed and size Acks Configuration: Use all for reliability, 1 for latency Retry Strategy: Implement retries with backoff Idempotence: Enable for exactly-once semantics in critical paths Consumer Best Practices Offset Management: Use manual commit for critical processing Batch Processing: Increase max.poll.records for efficiency Rebalance Handling: Implement graceful shutdown Error Handling: Dead letter queues for poison messages Monitoring: Track consumer lag and processing time Security Configuration Encryption: TLS for all client-broker communication Authentication: SASL/SCRAM or mTLS for production Authorization: ACLs with least privilege principle Quotas: Implement client quotas to prevent abuse Audit Logging: Log all access and configuration changes Performance Tuning Broker Configuration: Optimize for workload type (throughput vs latency) JVM Tuning: Heap size and garbage collector selection OS Tuning: File descriptor limits, network settings Monitoring: Metrics for throughput, latency, and errors Capacity Planning: Regular review and scaling assessment

Security:

Encryption: TLS enabled for Client-Broker and Inter-broker. Auth: SASL/SCRAM or mTLS enabled. ACLs: Principle of least privilege (Topic read/write).

返回排行榜