Segment Anything Model (SAM)

Comprehensive guide to using Meta AI's Segment Anything Model for zero-shot image segmentation.

When to use SAM

Use SAM when:

Need to segment any object in images without task-specific training Building interactive annotation tools with point/box prompts Generating training data for other vision models Need zero-shot transfer to new image domains Building object detection/segmentation pipelines Processing medical, satellite, or domain-specific images

Key features:

Zero-shot segmentation: Works on any image domain without fine-tuning Flexible prompts: Points, bounding boxes, or previous masks Automatic segmentation: Generate all object masks automatically High quality: Trained on 1.1 billion masks from 11 million images Multiple model sizes: ViT-B (fastest), ViT-L, ViT-H (most accurate) ONNX export: Deploy in browsers and edge devices

Use alternatives instead:

YOLO/Detectron2: For real-time object detection with classes Mask2Former: For semantic/panoptic segmentation with categories GroundingDINO + SAM: For text-prompted segmentation SAM 2: For video segmentation tasks Quick start Installation

From GitHub

pip install git+https://github.com/facebookresearch/segment-anything.git

Optional dependencies

pip install opencv-python pycocotools matplotlib

Or use HuggingFace transformers

pip install transformers

Download checkpoints

ViT-H (largest, most accurate) - 2.4GB

wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

ViT-L (medium) - 1.2GB

wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth

ViT-B (smallest, fastest) - 375MB

wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth

Basic usage with SamPredictor import numpy as np from segment_anything import sam_model_registry, SamPredictor

Load model

sam = sam_model_registry"vit_h" sam.to(device="cuda")

Create predictor

predictor = SamPredictor(sam)

Set image (computes embeddings once)

image = cv2.imread("image.jpg") image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) predictor.set_image(image)

Predict with point prompts

input_point = np.array([[500, 375]]) # (x, y) coordinates input_label = np.array([1]) # 1 = foreground, 0 = background

masks, scores, logits = predictor.predict( point_coords=input_point, point_labels=input_label, multimask_output=True # Returns 3 mask options )

Select best mask

best_mask = masks[np.argmax(scores)]

HuggingFace Transformers import torch from PIL import Image from transformers import SamModel, SamProcessor

Load model and processor

model = SamModel.from_pretrained("facebook/sam-vit-huge") processor = SamProcessor.from_pretrained("facebook/sam-vit-huge") model.to("cuda")

Process image with point prompt

image = Image.open("image.jpg") input_points = [[[450, 600]]] # Batch of points

inputs = processor(image, input_points=input_points, return_tensors="pt") inputs = {k: v.to("cuda") for k, v in inputs.items()}

Generate masks

with torch.no_grad(): outputs = model(**inputs)

Post-process masks to original size

masks = processor.image_processor.post_process_masks( outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu() )

Core concepts Model architecture SAM Architecture: ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Image Encoder │────▶│ Prompt Encoder │────▶│ Mask Decoder │ │ (ViT) │ │ (Points/Boxes) │ │ (Transformer) │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ Image Embeddings Prompt Embeddings Masks + IoU (computed once) (per prompt) predictions

Model variants Model Checkpoint Size Speed Accuracy ViT-H vit_h 2.4 GB Slowest Best ViT-L vit_l 1.2 GB Medium Good ViT-B vit_b 375 MB Fastest Good Prompt types Prompt Description Use Case Point (foreground) Click on object Single object selection Point (background) Click outside object Exclude regions Bounding box Rectangle around object Larger objects Previous mask Low-res mask input Iterative refinement Interactive segmentation Point prompts

Single foreground point

input_point = np.array([[500, 375]]) input_label = np.array([1])

masks, scores, logits = predictor.predict( point_coords=input_point, point_labels=input_label, multimask_output=True )

Multiple points (foreground + background)

input_points = np.array([[500, 375], [600, 400], [450, 300]]) input_labels = np.array([1, 1, 0]) # 2 foreground, 1 background

masks, scores, logits = predictor.predict( point_coords=input_points, point_labels=input_labels, multimask_output=False # Single mask when prompts are clear )

Box prompts

Bounding box [x1, y1, x2, y2]

input_box = np.array([425, 600, 700, 875])

masks, scores, logits = predictor.predict( box=input_box, multimask_output=False )

Combined prompts

Box + points for precise control

masks, scores, logits = predictor.predict( point_coords=np.array([[500, 375]]), point_labels=np.array([1]), box=np.array([400, 300, 700, 600]), multimask_output=False )

Iterative refinement

Initial prediction

masks, scores, logits = predictor.predict( point_coords=np.array([[500, 375]]), point_labels=np.array([1]), multimask_output=True )

Refine with additional point using previous mask

masks, scores, logits = predictor.predict( point_coords=np.array([[500, 375], [550, 400]]), point_labels=np.array([1, 0]), # Add background point mask_input=logits[np.argmax(scores)][None, :, :], # Use best mask multimask_output=False )

Automatic mask generation Basic automatic segmentation from segment_anything import SamAutomaticMaskGenerator

Create generator

mask_generator = SamAutomaticMaskGenerator(sam)

Generate all masks

masks = mask_generator.generate(image)

Each mask contains:

- segmentation: binary mask

- bbox: [x, y, w, h]

- area: pixel count

- predicted_iou: quality score

- stability_score: robustness score

- point_coords: generating point

Customized generation mask_generator = SamAutomaticMaskGenerator( model=sam, points_per_side=32, # Grid density (more = more masks) pred_iou_thresh=0.88, # Quality threshold stability_score_thresh=0.95, # Stability threshold crop_n_layers=1, # Multi-scale crops crop_n_points_downscale_factor=2, min_mask_region_area=100, # Remove tiny masks )

masks = mask_generator.generate(image)

Filtering masks

Sort by area (largest first)

masks = sorted(masks, key=lambda x: x['area'], reverse=True)

Filter by predicted IoU

high_quality = [m for m in masks if m['predicted_iou'] > 0.9]

Filter by stability score

stable_masks = [m for m in masks if m['stability_score'] > 0.95]

Batched inference Multiple images

Process multiple images efficiently

images = [cv2.imread(f"image_{i}.jpg") for i in range(10)]

all_masks = [] for image in images: predictor.set_image(image) masks, _, _ = predictor.predict( point_coords=np.array([[500, 375]]), point_labels=np.array([1]), multimask_output=True ) all_masks.append(masks)

Multiple prompts per image

Process multiple prompts efficiently (one image encoding)

predictor.set_image(image)

Batch of point prompts

points = [ np.array([[100, 100]]), np.array([[200, 200]]), np.array([[300, 300]]) ]

all_masks = [] for point in points: masks, scores, _ = predictor.predict( point_coords=point, point_labels=np.array([1]), multimask_output=True ) all_masks.append(masks[np.argmax(scores)])

ONNX deployment Export model python scripts/export_onnx_model.py \ --checkpoint sam_vit_h_4b8939.pth \ --model-type vit_h \ --output sam_onnx.onnx \ --return-single-mask

Use ONNX model import onnxruntime

Load ONNX model

ort_session = onnxruntime.InferenceSession("sam_onnx.onnx")

Run inference (image embeddings computed separately)

masks = ort_session.run( None, { "image_embeddings": image_embeddings, "point_coords": point_coords, "point_labels": point_labels, "mask_input": np.zeros((1, 1, 256, 256), dtype=np.float32), "has_mask_input": np.array([0], dtype=np.float32), "orig_im_size": np.array([h, w], dtype=np.float32) } )

Common workflows Workflow 1: Annotation tool import cv2

Load model

predictor = SamPredictor(sam) predictor.set_image(image)

def on_click(event, x, y, flags, param): if event == cv2.EVENT_LBUTTONDOWN: # Foreground point masks, scores, _ = predictor.predict( point_coords=np.array([[x, y]]), point_labels=np.array([1]), multimask_output=True ) # Display best mask display_mask(masks[np.argmax(scores)])

Workflow 2: Object extraction def extract_object(image, point): """Extract object at point with transparent background.""" predictor.set_image(image)

masks, scores, _ = predictor.predict(
    point_coords=np.array([point]),
    point_labels=np.array([1]),
    multimask_output=True
)

best_mask = masks[np.argmax(scores)]

# Create RGBA output
rgba = np.zeros((image.shape[0], image.shape[1], 4), dtype=np.uint8)
rgba[:, :, :3] = image
rgba[:, :, 3] = best_mask * 255

return rgba

Workflow 3: Medical image segmentation

Process medical images (grayscale to RGB)

medical_image = cv2.imread("scan.png", cv2.IMREAD_GRAYSCALE) rgb_image = cv2.cvtColor(medical_image, cv2.COLOR_GRAY2RGB)

predictor.set_image(rgb_image)

Segment region of interest

masks, scores, _ = predictor.predict( box=np.array([x1, y1, x2, y2]), # ROI bounding box multimask_output=True )

Output format Mask data structure

SamAutomaticMaskGenerator output

{ "segmentation": np.ndarray, # H×W binary mask "bbox": [x, y, w, h], # Bounding box "area": int, # Pixel count "predicted_iou": float, # 0-1 quality score "stability_score": float, # 0-1 robustness score "crop_box": [x, y, w, h], # Generation crop region "point_coords": [[x, y]], # Input point }

COCO RLE format from pycocotools import mask as mask_utils

Encode mask to RLE

rle = mask_utils.encode(np.asfortranarray(mask.astype(np.uint8))) rle["counts"] = rle["counts"].decode("utf-8")

Decode RLE to mask

decoded_mask = mask_utils.decode(rle)

Performance optimization GPU memory

Use smaller model for limited VRAM

sam = sam_model_registry"vit_b"

Process images in batches

Clear CUDA cache between large batches

torch.cuda.empty_cache()

Speed optimization

Use half precision

sam = sam.half()

Reduce points for automatic generation

mask_generator = SamAutomaticMaskGenerator( model=sam, points_per_side=16, # Default is 32 )

Use ONNX for deployment

Export with --return-single-mask for faster inference

Common issues Issue Solution Out of memory Use ViT-B model, reduce image size Slow inference Use ViT-B, reduce points_per_side Poor mask quality Try different prompts, use box + points Edge artifacts Use stability_score filtering Small objects missed Increase points_per_side References Advanced Usage - Batching, fine-tuning, integration Troubleshooting - Common issues and solutions Resources GitHub: https://github.com/facebookresearch/segment-anything Paper: https://arxiv.org/abs/2304.02643 Demo: https://segment-anything.com SAM 2 (Video): https://github.com/facebookresearch/segment-anything-2 HuggingFace: https://huggingface.co/facebook/sam-vit-huge

安装

From GitHub

Optional dependencies

Or use HuggingFace transformers

ViT-H (largest, most accurate) - 2.4GB

ViT-L (medium) - 1.2GB

ViT-B (smallest, fastest) - 375MB

Load model

Create predictor

Set image (computes embeddings once)

Predict with point prompts

Select best mask

Load model and processor

Process image with point prompt

Generate masks

Post-process masks to original size

Single foreground point

Multiple points (foreground + background)

Bounding box [x1, y1, x2, y2]

Box + points for precise control

Initial prediction

Refine with additional point using previous mask

Create generator

Generate all masks

Each mask contains:

- segmentation: binary mask

- bbox: [x, y, w, h]

- area: pixel count

- predicted_iou: quality score

- stability_score: robustness score

- point_coords: generating point

Sort by area (largest first)

Filter by predicted IoU

Filter by stability score

Process multiple images efficiently

Process multiple prompts efficiently (one image encoding)

Batch of point prompts

Load ONNX model

Run inference (image embeddings computed separately)

Load model

Process medical images (grayscale to RGB)

Segment region of interest

SamAutomaticMaskGenerator output

Encode mask to RLE

Decode RLE to mask

Use smaller model for limited VRAM

Process images in batches

Clear CUDA cache between large batches

Use half precision

Reduce points for automatic generation

Use ONNX for deployment

Export with --return-single-mask for faster inference