Mamba - Selective State Space Models Quick start

Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling.

Installation:

Install causal-conv1d (optional, for efficiency)

pip install causal-conv1d>=1.4.0

Install Mamba

pip install mamba-ssm

Or both together

pip install mamba-ssm[causal-conv1d]

Prerequisites: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+

Basic usage (Mamba block):

import torch from mamba_ssm import Mamba

batch, length, dim = 2, 64, 16 x = torch.randn(batch, length, dim).to("cuda")

model = Mamba( d_model=dim, # Model dimension d_state=16, # SSM state dimension d_conv=4, # Conv1d kernel size expand=2 # Expansion factor ).to("cuda")

y = model(x) # O(n) complexity! assert y.shape == x.shape

Common workflows Workflow 1: Language model with Mamba-2

Complete LM with generation:

from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel from mamba_ssm.models.config_mamba import MambaConfig import torch

Configure Mamba-2 LM

config = MambaConfig( d_model=1024, # Hidden dimension n_layer=24, # Number of layers vocab_size=50277, # Vocabulary size ssm_cfg=dict( layer="Mamba2", # Use Mamba-2 d_state=128, # Larger state for Mamba-2 headdim=64, # Head dimension ngroups=1 # Number of groups ) )

model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)

Generate text

input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long) output = model.generate( input_ids=input_ids, max_length=100, temperature=0.7, top_p=0.9 )

Workflow 2: Use pretrained Mamba models

Load from HuggingFace:

from transformers import AutoTokenizer from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel

Load pretrained model

model_name = "state-spaces/mamba-2.8b" tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b") # Use compatible tokenizer model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)

Generate

prompt = "The future of AI is" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda") output_ids = model.generate( input_ids=input_ids, max_length=200, temperature=0.7, top_p=0.9, repetition_penalty=1.2 ) generated_text = tokenizer.decode(output_ids[0]) print(generated_text)

Available models:

state-spaces/mamba-130m state-spaces/mamba-370m state-spaces/mamba-790m state-spaces/mamba-1.4b state-spaces/mamba-2.8b Workflow 3: Mamba-1 vs Mamba-2

Mamba-1 (smaller state):

from mamba_ssm import Mamba

model = Mamba( d_model=256, d_state=16, # Smaller state dimension d_conv=4, expand=2 ).to("cuda")

Mamba-2 (multi-head, larger state):

from mamba_ssm import Mamba2

model = Mamba2( d_model=256, d_state=128, # Larger state dimension d_conv=4, expand=2, headdim=64, # Head dimension for multi-head ngroups=1 # Parallel groups ).to("cuda")

Key differences:

State size: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128) Architecture: Mamba-2 has multi-head structure Normalization: Mamba-2 uses RMSNorm Distributed: Mamba-2 supports tensor parallelism Workflow 4: Benchmark vs Transformers

Generation speed comparison:

Benchmark Mamba

python benchmarks/benchmark_generation_mamba_simple.py \ --model-name "state-spaces/mamba-2.8b" \ --prompt "The future of machine learning is" \ --topp 0.9 --temperature 0.7 --repetition-penalty 1.2

Benchmark Transformer

python benchmarks/benchmark_generation_mamba_simple.py \ --model-name "EleutherAI/pythia-2.8b" \ --prompt "The future of machine learning is" \ --topp 0.9 --temperature 0.7 --repetition-penalty 1.2

Expected results:

Mamba: 5× faster inference Memory: No KV cache needed Scaling: Linear with sequence length When to use vs alternatives

Use Mamba when:

Need long sequences (100K+ tokens) Want faster inference than Transformers Memory-constrained (no KV cache) Building streaming applications Linear scaling important

Advantages:

O(n) complexity: Linear vs quadratic 5× faster inference: No attention overhead No KV cache: Lower memory usage Million-token sequences: Hardware-efficient Streaming: Constant memory per token

Use alternatives instead:

Transformers: Need best-in-class performance, have compute RWKV: Want RNN+Transformer hybrid RetNet: Need retention-based architecture Hyena: Want convolution-based approach Common issues

Issue: CUDA out of memory

Reduce batch size or use gradient checkpointing:

model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16) model.gradient_checkpointing_enable() # Enable checkpointing

Issue: Slow installation

Install binary wheels (not source):

pip install mamba-ssm --no-build-isolation

Issue: Missing causal-conv1d

Install separately:

pip install causal-conv1d>=1.4.0

Issue: Model not loading from HuggingFace

Use MambaLMHeadModel.from_pretrained (not AutoModel):

from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")

Advanced topics

Selective SSM: See references/selective-ssm.md for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity.

Mamba-2 architecture: See references/mamba2-details.md for multi-head structure, tensor parallelism, and distributed training setup.

Performance optimization: See references/performance.md for hardware-aware design, CUDA kernels, and memory efficiency techniques.

Hardware requirements GPU: NVIDIA with CUDA 11.6+ VRAM: 130M model: 2GB 370M model: 4GB 790M model: 8GB 1.4B model: 14GB 2.8B model: 28GB (FP16) Inference: 5× faster than Transformers Memory: No KV cache (lower than Transformers)

Performance (vs Transformers):

Speed: 5× faster inference Memory: 50% less (no KV cache) Scaling: Linear vs quadratic Resources Paper (Mamba-1): https://arxiv.org/abs/2312.00752 (Dec 2023) Paper (Mamba-2): https://arxiv.org/abs/2405.21060 (May 2024) GitHub: https://github.com/state-spaces/mamba ⭐ 13,000+ Models: https://huggingface.co/state-spaces Docs: Repository README and wiki

mamba-architecture

安装