ai-llm-inference

安装量: 56
排名: #13267

安装

npx skills add https://github.com/vasilyu1983/ai-agents-public --skill ai-llm-inference
LLMOps - Inference & Optimization - Production Skill Hub
Modern Best Practices (January 2026)
:
Treat inference as a
systems problem
SLOs, tail latency, retries, overload, and cache strategy.
Use
continuous batching / smart scheduling
when serving many concurrent requests (Orca scheduling:
https://www.usenix.org/conference/osdi22/presentation/yu
).
Use
KV-cache aware serving
(PagedAttention/vLLM:
https://arxiv.org/abs/2309.06180
) and
efficient attention kernels
(FlashAttention:
https://arxiv.org/abs/2205.14135
).
Use
speculative decoding
when latency is critical and draft-model quality is acceptable (speculative decoding:
https://arxiv.org/abs/2302.01318
).
Quantize only with
measured
quality impact and rollback plan (quantization must be validated on your eval set).
This skill provides
production-ready operational patterns
for optimizing LLM inference performance, cost, and reliability. It centralizes
decision rules
,
optimization strategies
,
configuration templates
, and
operational checklists
for inference workloads.
No theory. No narrative. Only what Codex can execute.
When to Use This Skill
Codex should activate this skill whenever the user asks for:
Optimizing LLM inference latency or throughput
Choosing quantization strategies (FP8/FP4/INT8/INT4)
Configuring vLLM, TensorRT-LLM, or DeepSpeed inference
Scaling LLM inference across GPUs (tensor/pipeline parallelism)
Building high-throughput LLM APIs
Improving context window performance (KV cache optimization)
Using speculative decoding for faster generation
Reducing cost per token
Profiling and benchmarking inference workloads
Planning infrastructure capacity
CPU/edge deployment patterns
High availability and resilience patterns
Scope Boundaries (Use These Skills for Depth)
Prompting, tuning, datasets
->
ai-llm
RAG pipeline construction
->
ai-rag
Deployment, APIs, monitoring
->
ai-mlops
Safety, governance
->
ai-mlops
Performance monitoring
->
qa-observability
Infrastructure operations
->
ops-devops-platform
Quick Reference
Task
Tool/Framework
Command/Pattern
When to Use
Latency budget
SLO + load model
TTFT/ITL + P95/P99 under load
Any production endpoint
Tail-latency control
Scheduling + timeouts
Admission control + queue caps + backpressure
Prevent p99 explosions
Throughput
Batching + KV-cache aware serving
Continuous batching + KV paging
High concurrency serving
Cost control
Model tiering + caching
Cache (prefix/response) + quotas
Reduce spend and overload risk
Long context
Prefill optimization
Chunked prefill + prompt compression
Long inputs and RAG-heavy apps
Parallelism
TP/PP/DP
Choose by model size and interconnect
Models that do not fit one device
Reliability
Resilience patterns
Timeouts + circuit breakers + idempotency
Avoid cascading failures
Decision Tree: Inference Optimization Strategy
Need to optimize LLM inference: [Optimization Path]
├─ High throughput (>10k tok/s) OR P99 variance > 3x P50?
│ └─ YES -> Disaggregated inference (prefill/decode separation)
│ See references/disaggregated-inference.md
├─ Primary constraint: Throughput?
│ ├─ Many concurrent users? -> batching + KV-cache aware serving + admission control
│ ├─ Chat/agents with KV reuse? -> SGLang (RadixAttention)
│ └─ Mostly batch/offline? -> batch inference jobs + large batches + spot capacity
├─ Primary constraint: Cost?
│ ├─ Can accept lower quality tier? -> model tiering (small/medium/large router)
│ └─ Must keep quality? -> caching + prompt/context reduction before quantization
├─ Primary constraint: Latency?
│ ├─ Draft model acceptable? -> speculative decoding
│ └─ Long context? -> prefill optimizations + FlashAttention-3 + context budgets
├─ Large model (>70B)?
│ ├─ Multiple GPUs? -> Tensor parallelism (NVLink required)
│ └─ Deep model? -> Pipeline parallelism (minimize bubbles)
├─ Hardware selection?
│ ├─ Memory-bound? -> more HBM, higher bandwidth
│ ├─ Latency-bound? -> faster clocks + kernel support
│ └─ Multi-node? -> prioritize interconnect (NVLink/RDMA) and topology
│ Notes: treat GPU/SKU advice as time-sensitive; verify with vendor docs and your own benchmarks.
│ See references/gpu-optimization-checklists.md and references/infrastructure-tuning.md
└─ Edge deployment?
└─ CPU + quantization -> llama.cpp/GGUF for constrained resources
Intake Checklist (REQUIRED)
Before recommending changes, collect (or infer) these inputs:
Model + variant (size, context length, precision/quantization, tokenizer)
Traffic shape (prompt/output length distributions, concurrency, QPS, streaming vs non-streaming)
SLOs and budgets (TTFT/ITL/total latency targets, error budget, cost per request)
Serving stack (engine/version, batching/scheduling settings, caching, parallelism, autoscaling)
Hardware and topology (GPU type/count, VRAM, NVLink/RDMA, CPU/RAM, storage, cluster/runtime)
Constraints (quality floor, safety requirements, rollout/rollback constraints)
Core Concepts & Practices
Core Concepts (Vendor-Agnostic)
Latency components
queueing + prefill + decode; optimize the largest contributor first.
Tail latency
p99 is dominated by queuing and long prompts; fix with admission control and context budgets.
Retries
retries can multiply load; bound retries and use hedged requests only with strict budgets.
Caching
prefix caching helps repeated system/tool scaffolds; response caching helps repeated questions (requires invalidation).
Security & privacy
prompts/outputs can contain sensitive data; scrub logs, enforce auth/tenancy, and rate-limit abuse (OWASP LLM Top 10:
https://owasp.org/www-project-top-10-for-large-language-model-applications/
).
Implementation Practices (Tooling Examples)
Measure under load
benchmark TTFT/ITL and p95/p99 with realistic concurrency and prompt lengths.
Separate environments
dev/stage/prod model configs; promote only after passing the inference review checklist.
Export telemetry
request-level tokens, TTFT/ITL, queue depth, GPU memory headroom, and error classes (OpenTelemetry GenAI semantic conventions:
https://opentelemetry.io/docs/specs/semconv/gen-ai/
).
Do / Avoid
Do
Do enforce
max_input_tokens
and
max_output_tokens
at the API boundary.
Do cap concurrency and queue depth; return overload errors quickly.
Do validate quality after any quantization or kernel change.
Avoid
Avoid unbounded retries (amplifies outages).
Avoid unbounded context windows (OOM + latency spikes).
Avoid benchmarking on single requests; always test with realistic concurrency.
Accuracy Protocol (REQUIRED)
Treat performance ratios (for example, "2x faster") as hypotheses unless a source is cited and the workload is comparable.
Do not recommend hardware/SKU changes without stating assumptions (model size, context length, concurrency, interconnect).
Prefer a measured baseline + checklist-driven rollout over "best practice" claims.
Resources (Detailed Operational Guides)
For comprehensive guides on specific topics, see:
Infrastructure & Serving
Disaggregated Inference
- Prefill/decode separation (2025+ standard)
Infrastructure Tuning
- OS, container, Kubernetes optimization for GPU workloads
Serving Architectures
- Production serving stack patterns (vLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo)
Resilience & HA Patterns
- Multi-region, failover, traffic management
Performance Optimization
Quantization Patterns
- FP8/FP4/INT8/INT4 decision trees (FP8 first, INT8 not on Blackwell)
KV Cache Optimization
- PagedAttention, FlashAttention-3, FlashInfer, RadixAttention
Parallelism Patterns
- Tensor/pipeline/expert parallelism strategies
Optimization Strategies
- Throughput, cost, memory optimization
Batching & Scheduling
- Continuous batching and throughput patterns
Deployment & Operations
Edge & CPU Optimization
- llama.cpp, GGUF, mobile/browser deployment
GPU Optimization Checklists
- Hardware-specific tuning
Speculative Decoding Guide
- Advanced generation acceleration
Profiling & Capacity Planning
- Benchmarking, SLOs, replica sizing
Cost & Routing
Cost Optimization Patterns
- Token budgets, caching economics, model tiering, cost-per-outcome tracking
Multi-Model Routing
- Router architectures, quality-cost tradeoffs, cascading strategies, A/B routing
Streaming Patterns
- SSE/WebSocket serving, token-by-token delivery, backpressure, client integration
Templates
Inference Configs
Production-ready configuration templates for leading inference engines:
vLLM Configuration
- Continuous batching, PagedAttention setup
TensorRT-LLM Configuration
- NVIDIA kernel optimizations
DeepSpeed Inference
- PyTorch-friendly inference
Quantization & Compression
Model compression templates for reducing memory and cost:
GPTQ Quantization
- GPU post-training quantization
AWQ Quantization
- Activation-aware weight quantization
GGUF Format
- CPU/edge optimized formats
Serving Pipelines
High-throughput serving architectures:
LLM API Server
- FastAPI + vLLM production setup
High-Throughput Setup
- Multi-replica scaling patterns
Caching & Batching
Performance optimization templates:
Prefix Caching
- KV cache reuse strategies
Batching Configuration
- Continuous batching tuning
Benchmarking
Performance measurement and validation:
Latency & Throughput Testing
- Load testing framework
Checklists
Inference Performance Review Checklist
- Baseline, bottlenecks, rollout readiness
Navigation
Resources
references/disaggregated-inference.md
references/serving-architectures.md
references/profiling-and-capacity-planning.md
references/gpu-optimization-checklists.md
references/speculative-decoding-guide.md
references/resilience-ha-patterns.md
references/optimization-strategies.md
references/kv-cache-optimization.md
references/batching-and-scheduling.md
references/quantization-patterns.md
references/parallelism-patterns.md
references/edge-cpu-optimization.md
references/infrastructure-tuning.md
references/cost-optimization-patterns.md
references/multi-model-routing.md
references/streaming-patterns.md
Templates
assets/serving/template-llm-api.md
assets/serving/template-high-throughput-setup.md
assets/inference/template-vllm-config.md
assets/inference/template-tensorrtllm-config.md
assets/inference/template-deepspeed-inference.md
assets/quantization/template-awq.md
assets/quantization/template-gptq.md
assets/quantization/template-gguf.md
assets/batching/template-batching-config.md
assets/caching/template-prefix-caching.md
assets/benchmarking/template-latency-throughput-test.md
assets/checklists/inference-review-checklist.md
Data
data/sources.json
- Curated external references
Trend Awareness Protocol
IMPORTANT
When users ask recommendation questions about LLM inference, you MUST use WebSearch to check current trends before answering.
Trigger Conditions
"What's the best inference engine for [use case]?"
"What should I use for [serving/quantization/batching]?"
"What's the latest in LLM inference optimization?"
"Current best practices for [vLLM/TensorRT/quantization]?"
"Is [inference tool] still relevant in 2026?"
"[vLLM] vs [TensorRT-LLM] vs [SGLang]?"
"Best quantization method for [model size]?"
"What GPU should I use for inference?"
Required Searches
Search:
"LLM inference optimization best practices 2026"
Search:
"[vLLM/TensorRT-LLM/SGLang] comparison 2026"
Search:
"LLM quantization trends January 2026"
Search:
"LLM serving new releases 2026"
What to Report
After searching, provide:
Current landscape
What serving engines are popular NOW (not 6 months ago)
Emerging trends
New inference optimizations gaining traction
Deprecated/declining
Techniques or tools losing relevance
Recommendation
Based on fresh data, not just static knowledge Example Topics (verify with fresh search) Inference engines (vLLM 0.7+, TensorRT-LLM, SGLang, llama.cpp) Quantization methods (FP8, AWQ, GPTQ, GGUF, bitsandbytes) Attention kernels (FlashAttention-3, FlashInfer, xFormers) Speculative decoding advances KV cache optimization techniques New GPU architectures (H200, Blackwell) and their optimizations
返回排行榜