LLMOps - Inference & Optimization - Production Skill Hub

Modern Best Practices (January 2026)

:

Treat inference as a

systems problem

SLOs, tail latency, retries, overload, and cache strategy.

Use

continuous batching / smart scheduling

when serving many concurrent requests (Orca scheduling:

https://www.usenix.org/conference/osdi22/presentation/yu

).

Use

KV-cache aware serving

(PagedAttention/vLLM:

https://arxiv.org/abs/2309.06180

) and

efficient attention kernels

(FlashAttention:

https://arxiv.org/abs/2205.14135

).

Use

speculative decoding

when latency is critical and draft-model quality is acceptable (speculative decoding:

https://arxiv.org/abs/2302.01318

).

Quantize only with

measured

quality impact and rollback plan (quantization must be validated on your eval set).

This skill provides

production-ready operational patterns

for optimizing LLM inference performance, cost, and reliability. It centralizes

decision rules

,

optimization strategies

,

configuration templates

, and

operational checklists

for inference workloads.

No theory. No narrative. Only what Codex can execute.

When to Use This Skill

Codex should activate this skill whenever the user asks for:

Optimizing LLM inference latency or throughput

Choosing quantization strategies (FP8/FP4/INT8/INT4)

Configuring vLLM, TensorRT-LLM, or DeepSpeed inference

Scaling LLM inference across GPUs (tensor/pipeline parallelism)

Building high-throughput LLM APIs

Improving context window performance (KV cache optimization)

Using speculative decoding for faster generation

Reducing cost per token

Profiling and benchmarking inference workloads

Planning infrastructure capacity

CPU/edge deployment patterns

High availability and resilience patterns

Scope Boundaries (Use These Skills for Depth)

Prompting, tuning, datasets

->

ai-llm

RAG pipeline construction

->

ai-rag

Deployment, APIs, monitoring

->

ai-mlops

Safety, governance

->

ai-mlops

Performance monitoring

->

qa-observability

Infrastructure operations

->

ops-devops-platform

Quick Reference

Task

Tool/Framework

Command/Pattern

When to Use

Latency budget

SLO + load model

TTFT/ITL + P95/P99 under load

Any production endpoint

Tail-latency control

Scheduling + timeouts

Admission control + queue caps + backpressure

Prevent p99 explosions

Throughput

Batching + KV-cache aware serving

Continuous batching + KV paging

High concurrency serving

Cost control

Model tiering + caching

Cache (prefix/response) + quotas

Reduce spend and overload risk

Long context

Prefill optimization

Chunked prefill + prompt compression

Long inputs and RAG-heavy apps

Parallelism

TP/PP/DP

Choose by model size and interconnect

Models that do not fit one device

Reliability

Resilience patterns

Timeouts + circuit breakers + idempotency

Avoid cascading failures

Decision Tree: Inference Optimization Strategy

Need to optimize LLM inference: [Optimization Path]

│

├─ High throughput (>10k tok/s) OR P99 variance > 3x P50?

│ └─ YES -> Disaggregated inference (prefill/decode separation)

│ See references/disaggregated-inference.md

│

├─ Primary constraint: Throughput?

│ ├─ Many concurrent users? -> batching + KV-cache aware serving + admission control

│ ├─ Chat/agents with KV reuse? -> SGLang (RadixAttention)

│ └─ Mostly batch/offline? -> batch inference jobs + large batches + spot capacity

│

├─ Primary constraint: Cost?

│ ├─ Can accept lower quality tier? -> model tiering (small/medium/large router)

│ └─ Must keep quality? -> caching + prompt/context reduction before quantization

│

├─ Primary constraint: Latency?

│ ├─ Draft model acceptable? -> speculative decoding

│ └─ Long context? -> prefill optimizations + FlashAttention-3 + context budgets

│

├─ Large model (>70B)?

│ ├─ Multiple GPUs? -> Tensor parallelism (NVLink required)

│ └─ Deep model? -> Pipeline parallelism (minimize bubbles)

│

├─ Hardware selection?

│ ├─ Memory-bound? -> more HBM, higher bandwidth

│ ├─ Latency-bound? -> faster clocks + kernel support

│ └─ Multi-node? -> prioritize interconnect (NVLink/RDMA) and topology

│

│ Notes: treat GPU/SKU advice as time-sensitive; verify with vendor docs and your own benchmarks.

│ See references/gpu-optimization-checklists.md and references/infrastructure-tuning.md

│

└─ Edge deployment?

└─ CPU + quantization -> llama.cpp/GGUF for constrained resources

Intake Checklist (REQUIRED)

Before recommending changes, collect (or infer) these inputs:

Model + variant (size, context length, precision/quantization, tokenizer)

Traffic shape (prompt/output length distributions, concurrency, QPS, streaming vs non-streaming)

SLOs and budgets (TTFT/ITL/total latency targets, error budget, cost per request)

Serving stack (engine/version, batching/scheduling settings, caching, parallelism, autoscaling)

Hardware and topology (GPU type/count, VRAM, NVLink/RDMA, CPU/RAM, storage, cluster/runtime)

Constraints (quality floor, safety requirements, rollout/rollback constraints)

Core Concepts & Practices

Core Concepts (Vendor-Agnostic)

Latency components

queueing + prefill + decode; optimize the largest contributor first.

Tail latency

p99 is dominated by queuing and long prompts; fix with admission control and context budgets.

Retries

retries can multiply load; bound retries and use hedged requests only with strict budgets.

Caching

prefix caching helps repeated system/tool scaffolds; response caching helps repeated questions (requires invalidation).

Security & privacy

prompts/outputs can contain sensitive data; scrub logs, enforce auth/tenancy, and rate-limit abuse (OWASP LLM Top 10:

https://owasp.org/www-project-top-10-for-large-language-model-applications/

).

Implementation Practices (Tooling Examples)

Measure under load

benchmark TTFT/ITL and p95/p99 with realistic concurrency and prompt lengths.

Separate environments

dev/stage/prod model configs; promote only after passing the inference review checklist.

Export telemetry

request-level tokens, TTFT/ITL, queue depth, GPU memory headroom, and error classes (OpenTelemetry GenAI semantic conventions:

https://opentelemetry.io/docs/specs/semconv/gen-ai/

).

Do / Avoid

Do

Do enforce

max_input_tokens

and

max_output_tokens

at the API boundary.

Do cap concurrency and queue depth; return overload errors quickly.

Do validate quality after any quantization or kernel change.

Avoid

Avoid unbounded retries (amplifies outages).

Avoid unbounded context windows (OOM + latency spikes).

Avoid benchmarking on single requests; always test with realistic concurrency.

Accuracy Protocol (REQUIRED)

Treat performance ratios (for example, "2x faster") as hypotheses unless a source is cited and the workload is comparable.

Do not recommend hardware/SKU changes without stating assumptions (model size, context length, concurrency, interconnect).

Prefer a measured baseline + checklist-driven rollout over "best practice" claims.

Resources (Detailed Operational Guides)

For comprehensive guides on specific topics, see:

Infrastructure & Serving

Disaggregated Inference

- Prefill/decode separation (2025+ standard)

Infrastructure Tuning

- OS, container, Kubernetes optimization for GPU workloads

Serving Architectures

- Production serving stack patterns (vLLM, SGLang, TensorRT-LLM, NVIDIA Dynamo)

Resilience & HA Patterns

- Multi-region, failover, traffic management

Performance Optimization

Quantization Patterns

- FP8/FP4/INT8/INT4 decision trees (FP8 first, INT8 not on Blackwell)

KV Cache Optimization

- PagedAttention, FlashAttention-3, FlashInfer, RadixAttention

Parallelism Patterns

- Tensor/pipeline/expert parallelism strategies

Optimization Strategies

- Throughput, cost, memory optimization

Batching & Scheduling

- Continuous batching and throughput patterns

Deployment & Operations

Edge & CPU Optimization

- llama.cpp, GGUF, mobile/browser deployment

GPU Optimization Checklists

- Hardware-specific tuning

Speculative Decoding Guide

- Advanced generation acceleration

Profiling & Capacity Planning

- Benchmarking, SLOs, replica sizing

Cost & Routing

Cost Optimization Patterns

- Token budgets, caching economics, model tiering, cost-per-outcome tracking

Multi-Model Routing

- Router architectures, quality-cost tradeoffs, cascading strategies, A/B routing

Streaming Patterns

- SSE/WebSocket serving, token-by-token delivery, backpressure, client integration

Templates

Inference Configs

Production-ready configuration templates for leading inference engines:

vLLM Configuration

- Continuous batching, PagedAttention setup

TensorRT-LLM Configuration

- NVIDIA kernel optimizations

DeepSpeed Inference

- PyTorch-friendly inference

Quantization & Compression

Model compression templates for reducing memory and cost:

GPTQ Quantization

- GPU post-training quantization

AWQ Quantization

- Activation-aware weight quantization

GGUF Format

- CPU/edge optimized formats

Serving Pipelines

High-throughput serving architectures:

LLM API Server

- FastAPI + vLLM production setup

High-Throughput Setup

- Multi-replica scaling patterns

Caching & Batching

Performance optimization templates:

Prefix Caching

- KV cache reuse strategies

Batching Configuration

- Continuous batching tuning

Benchmarking

Performance measurement and validation:

Latency & Throughput Testing

- Load testing framework

Checklists

Inference Performance Review Checklist

- Baseline, bottlenecks, rollout readiness

Navigation

Resources

references/disaggregated-inference.md

references/serving-architectures.md

references/profiling-and-capacity-planning.md

references/gpu-optimization-checklists.md

references/speculative-decoding-guide.md

references/resilience-ha-patterns.md

references/optimization-strategies.md

references/kv-cache-optimization.md

references/batching-and-scheduling.md

references/quantization-patterns.md

references/parallelism-patterns.md

references/edge-cpu-optimization.md

references/infrastructure-tuning.md

references/cost-optimization-patterns.md

references/multi-model-routing.md

references/streaming-patterns.md

Templates

assets/serving/template-llm-api.md

assets/serving/template-high-throughput-setup.md

assets/inference/template-vllm-config.md

assets/inference/template-tensorrtllm-config.md

assets/inference/template-deepspeed-inference.md

assets/quantization/template-awq.md

assets/quantization/template-gptq.md

assets/quantization/template-gguf.md

assets/batching/template-batching-config.md

assets/caching/template-prefix-caching.md

assets/benchmarking/template-latency-throughput-test.md

assets/checklists/inference-review-checklist.md

Data

data/sources.json

- Curated external references

Trend Awareness Protocol

IMPORTANT

When users ask recommendation questions about LLM inference, you MUST use WebSearch to check current trends before answering.

Trigger Conditions

"What's the best inference engine for [use case]?"

"What should I use for [serving/quantization/batching]?"

"What's the latest in LLM inference optimization?"

"Current best practices for [vLLM/TensorRT/quantization]?"

"Is [inference tool] still relevant in 2026?"

"[vLLM] vs [TensorRT-LLM] vs [SGLang]?"

"Best quantization method for [model size]?"

"What GPU should I use for inference?"

Required Searches

Search:

"LLM inference optimization best practices 2026"

Search:

"[vLLM/TensorRT-LLM/SGLang] comparison 2026"

Search:

"LLM quantization trends January 2026"

Search:

"LLM serving new releases 2026"

What to Report

After searching, provide:

Current landscape

What serving engines are popular NOW (not 6 months ago)

Emerging trends

New inference optimizations gaining traction

Deprecated/declining

Techniques or tools losing relevance
Recommendation: Based on fresh data, not just static knowledge Example Topics (verify with fresh search) Inference engines (vLLM 0.7+, TensorRT-LLM, SGLang, llama.cpp) Quantization methods (FP8, AWQ, GPTQ, GGUF, bitsandbytes) Attention kernels (FlashAttention-3, FlashInfer, xFormers) Speculative decoding advances KV cache optimization techniques New GPU architectures (H200, Blackwell) and their optimizations

ai-llm-inference

安装