nemo-curator

安装量: 181
排名: #4742

安装

npx skills add https://github.com/davila7/claude-code-templates --skill nemo-curator

NeMo Curator - GPU-Accelerated Data Curation

NVIDIA's toolkit for preparing high-quality training data for LLMs.

When to use NeMo Curator

Use NeMo Curator when:

Preparing LLM training data from web scrapes (Common Crawl) Need fast deduplication (16× faster than CPU) Curating multi-modal datasets (text, images, video, audio) Filtering low-quality or toxic content Scaling data processing across GPU cluster

Performance:

16× faster fuzzy deduplication (8TB RedPajama v2) 40% lower TCO vs CPU alternatives Near-linear scaling across GPU nodes

Use alternatives instead:

datatrove: CPU-based, open-source data processing dolma: Allen AI's data toolkit Ray Data: General ML data processing (no curation focus) Quick start Installation

Text curation (CUDA 12)

uv pip install "nemo-curator[text_cuda12]"

All modalities

uv pip install "nemo-curator[all_cuda12]"

CPU-only (slower)

uv pip install "nemo-curator[cpu]"

Basic text curation pipeline from nemo_curator import ScoreFilter, Modify from nemo_curator.datasets import DocumentDataset import pandas as pd

Load data

df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]}) dataset = DocumentDataset(df)

Quality filtering

def quality_score(doc): return len(doc["text"].split()) > 5 # Filter short docs

filtered = ScoreFilter(quality_score)(dataset)

Deduplication

from nemo_curator.modules import ExactDuplicates deduped = ExactDuplicates()(filtered)

Save

deduped.to_parquet("curated_data/")

Data curation pipeline Stage 1: Quality filtering from nemo_curator.filters import ( WordCountFilter, RepeatedLinesFilter, UrlRatioFilter, NonAlphaNumericFilter )

Apply 30+ heuristic filters

from nemo_curator import ScoreFilter

Word count filter

dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))

Remove repetitive content

dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))

URL ratio filter

dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))

Stage 2: Deduplication

Exact deduplication:

from nemo_curator.modules import ExactDuplicates

Remove exact duplicates

deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)

Fuzzy deduplication (16× faster on GPU):

from nemo_curator.modules import FuzzyDuplicates

MinHash + LSH deduplication

fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash parameters num_buckets=20, hash_method="md5" )

deduped = fuzzy_dedup(dataset)

Semantic deduplication:

from nemo_curator.modules import SemanticDuplicates

Embedding-based deduplication

semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", threshold=0.8 # Cosine similarity threshold )

deduped = semantic_dedup(dataset)

Stage 3: PII redaction from nemo_curator.modules import Modify from nemo_curator.modifiers import PIIRedactor

Redact personally identifiable information

pii_redactor = PIIRedactor( supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"], anonymize_action="replace" # or "redact" )

redacted = Modify(pii_redactor)(dataset)

Stage 4: Classifier filtering from nemo_curator.classifiers import QualityClassifier

Quality classification

quality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda" )

Filter low-quality documents

high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)

GPU acceleration GPU vs CPU performance Operation CPU (16 cores) GPU (A100) Speedup Fuzzy dedup (8TB) 120 hours 7.5 hours 16× Exact dedup (1TB) 8 hours 0.5 hours 16× Quality filtering 2 hours 0.2 hours 10× Multi-GPU scaling from nemo_curator import get_client import dask_cuda

Initialize GPU cluster

client = get_client(cluster_type="gpu", n_workers=8)

Process with 8 GPUs

deduped = FuzzyDuplicates(...)(dataset)

Multi-modal curation Image curation from nemo_curator.image import ( AestheticFilter, NSFWFilter, CLIPEmbedder )

Aesthetic scoring

aesthetic_filter = AestheticFilter(threshold=5.0) filtered_images = aesthetic_filter(image_dataset)

NSFW detection

nsfw_filter = NSFWFilter(threshold=0.9) safe_images = nsfw_filter(filtered_images)

Generate CLIP embeddings

clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32") image_embeddings = clip_embedder(safe_images)

Video curation from nemo_curator.video import ( SceneDetector, ClipExtractor, InternVideo2Embedder )

Detect scenes

scene_detector = SceneDetector(threshold=27.0) scenes = scene_detector(video_dataset)

Extract clips

clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0) clips = clip_extractor(scenes)

Generate embeddings

video_embedder = InternVideo2Embedder() video_embeddings = video_embedder(clips)

Audio curation from nemo_curator.audio import ( ASRInference, WERFilter, DurationFilter )

ASR transcription

asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc") transcribed = asr(audio_dataset)

Filter by WER (word error rate)

wer_filter = WERFilter(max_wer=0.3) high_quality_audio = wer_filter(transcribed)

Duration filtering

duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0) filtered_audio = duration_filter(high_quality_audio)

Common patterns Web scrape curation (Common Crawl) from nemo_curator import ScoreFilter, Modify from nemo_curator.filters import * from nemo_curator.modules import * from nemo_curator.datasets import DocumentDataset

Load Common Crawl data

dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")

Pipeline

pipeline = [ # 1. Quality filtering WordCountFilter(min_words=100, max_words=50000), RepeatedLinesFilter(max_repeated_line_fraction=0.2), SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3), UrlRatioFilter(max_url_ratio=0.3),

# 2. Language filtering
LanguageIdentificationFilter(target_languages=["en"]),

# 3. Deduplication
ExactDuplicates(id_field="id", text_field="text"),
FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),

# 4. PII redaction
PIIRedactor(),

# 5. NSFW filtering
NSFWClassifier(threshold=0.8)

]

Execute

for stage in pipeline: dataset = stage(dataset)

Save

dataset.to_parquet("curated_common_crawl/")

Distributed processing from nemo_curator import get_client from dask_cuda import LocalCUDACluster

Multi-GPU cluster

cluster = LocalCUDACluster(n_workers=8) client = get_client(cluster=cluster)

Process large dataset

dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet") deduped = FuzzyDuplicates(...)(dataset)

Cleanup

client.close() cluster.close()

Performance benchmarks Fuzzy deduplication (8TB RedPajama v2) CPU (256 cores): 120 hours GPU (8× A100): 7.5 hours Speedup: 16× Exact deduplication (1TB) CPU (64 cores): 8 hours GPU (4× A100): 0.5 hours Speedup: 16× Quality filtering (100GB) CPU (32 cores): 2 hours GPU (2× A100): 0.2 hours Speedup: 10× Cost comparison

CPU-based curation (AWS c5.18xlarge × 10):

Cost: $3.60/hour × 10 = $36/hour Time for 8TB: 120 hours Total: $4,320

GPU-based curation (AWS p4d.24xlarge × 2):

Cost: $32.77/hour × 2 = $65.54/hour Time for 8TB: 7.5 hours Total: $491.55

Savings: 89% reduction ($3,828 saved)

Supported data formats Input: Parquet, JSONL, CSV Output: Parquet (recommended), JSONL WebDataset: TAR archives for multi-modal Use cases

Production deployments:

NVIDIA used NeMo Curator to prepare Nemotron-4 training data Open-source datasets curated: RedPajama v2, The Pile References Filtering Guide - 30+ quality filters, heuristics Deduplication Guide - Exact, fuzzy, semantic methods Resources GitHub: https://github.com/NVIDIA/NeMo-Curator ⭐ 500+ Docs: https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/ Version: 0.4.0+ License: Apache 2.0

返回排行榜