TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

When to use TensorRT-LLM

Use TensorRT-LLM when:

Deploying on NVIDIA GPUs (A100, H100, GB200) Need maximum throughput (24,000+ tokens/sec on Llama 3) Require low latency for real-time applications Working with quantized models (FP8, INT4, FP4) Scaling across multiple GPUs or nodes

Use vLLM instead when:

Need simpler setup and Python-first API Want PagedAttention without TensorRT compilation Working with AMD GPUs or non-NVIDIA hardware

Use llama.cpp instead when:

Deploying on CPU or Apple Silicon Need edge deployment without NVIDIA GPUs Want simpler GGUF quantization format Quick start Installation

Docker (recommended)

docker pull nvidia/tensorrt_llm:latest

pip install

pip install tensorrt_llm==1.2.0rc3

Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

Basic inference from tensorrt_llm import LLM, SamplingParams

Initialize model

llm = LLM(model="meta-llama/Meta-Llama-3-8B")

Configure sampling

sampling_params = SamplingParams( max_tokens=100, temperature=0.7, top_p=0.9 )

Generate

prompts = ["Explain quantum computing"] outputs = llm.generate(prompts, sampling_params)

for output in outputs: print(output.text)

Serving with trtllm-serve

Start server (automatic model download and compilation)

trtllm-serve meta-llama/Meta-Llama-3-8B \ --tp_size 4 \ # Tensor parallelism (4 GPUs) --max_batch_size 256 \ --max_num_tokens 4096

Client request

curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'

Key features Performance optimizations In-flight batching: Dynamic batching during generation Paged KV cache: Efficient memory management Flash Attention: Optimized attention kernels Quantization: FP8, INT4, FP4 for 2-4× faster inference CUDA graphs: Reduced kernel launch overhead Parallelism Tensor parallelism (TP): Split model across GPUs Pipeline parallelism (PP): Layer-wise distribution Expert parallelism: For Mixture-of-Experts models Multi-node: Scale beyond single machine Advanced features Speculative decoding: Faster generation with draft models LoRA serving: Efficient multi-adapter deployment Disaggregated serving: Separate prefill and generation Common patterns Quantized model (FP8) from tensorrt_llm import LLM

Load FP8 quantized model (2× faster, 50% memory)

llm = LLM( model="meta-llama/Meta-Llama-3-70B", dtype="fp8", max_num_tokens=8192 )

Inference same as before

outputs = llm.generate(["Summarize this article..."])

Multi-GPU deployment

Tensor parallelism across 8 GPUs

llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=8, dtype="fp8" )

Batch inference

Process 100 prompts efficiently

prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate( prompts, sampling_params=SamplingParams(max_tokens=200) )

Automatic in-flight batching for maximum throughput

Performance benchmarks

Meta Llama 3-8B (H100 GPU):

Throughput: 24,000 tokens/sec Latency: ~10ms per token vs PyTorch: 100× faster

Llama 3-70B (8× A100 80GB):

FP8 quantization: 2× faster than FP16 Memory: 50% reduction with FP8 Supported models LLaMA family: Llama 2, Llama 3, CodeLlama GPT family: GPT-2, GPT-J, GPT-NeoX Qwen: Qwen, Qwen2, QwQ DeepSeek: DeepSeek-V2, DeepSeek-V3 Mixtral: Mixtral-8x7B, Mixtral-8x22B Vision: LLaVA, Phi-3-vision 100+ models on HuggingFace References Optimization Guide - Quantization, batching, KV cache tuning Multi-GPU Setup - Tensor/pipeline parallelism, multi-node Serving Guide - Production deployment, monitoring, autoscaling Resources Docs: https://nvidia.github.io/TensorRT-LLM/ GitHub: https://github.com/NVIDIA/TensorRT-LLM Models: https://huggingface.co/models?library=tensorrt_llm

tensorrt-llm

安装