sentence-transformers

安装量: 201
排名: #4295

安装

npx skills add https://github.com/davila7/claude-code-templates --skill sentence-transformers

Sentence Transformers - State-of-the-Art Embeddings

Python framework for sentence and text embeddings using transformers.

When to use Sentence Transformers

Use when:

Need high-quality embeddings for RAG Semantic similarity and search Text clustering and classification Multilingual embeddings (100+ languages) Running embeddings locally (no API) Cost-effective alternative to OpenAI embeddings

Metrics:

15,700+ GitHub stars 5000+ pre-trained models 100+ languages supported Based on PyTorch/Transformers

Use alternatives instead:

OpenAI Embeddings: Need API-based, highest quality Instructor: Task-specific instructions Cohere Embed: Managed service Quick start Installation pip install sentence-transformers

Basic usage from sentence_transformers import SentenceTransformer

Load model

model = SentenceTransformer('all-MiniLM-L6-v2')

Generate embeddings

sentences = [ "This is an example sentence", "Each sentence is converted to a vector" ]

embeddings = model.encode(sentences) print(embeddings.shape) # (2, 384)

Cosine similarity

from sentence_transformers.util import cos_sim similarity = cos_sim(embeddings[0], embeddings[1]) print(f"Similarity: {similarity.item():.4f}")

Popular models General purpose

Fast, good quality (384 dim)

model = SentenceTransformer('all-MiniLM-L6-v2')

Better quality (768 dim)

model = SentenceTransformer('all-mpnet-base-v2')

Best quality (1024 dim, slower)

model = SentenceTransformer('all-roberta-large-v1')

Multilingual

50+ languages

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

100+ languages

model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

Domain-specific

Legal domain

model = SentenceTransformer('nlpaueb/legal-bert-base-uncased')

Scientific papers

model = SentenceTransformer('allenai/specter')

Code

model = SentenceTransformer('microsoft/codebert-base')

Semantic search from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

Corpus

corpus = [ "Python is a programming language", "Machine learning uses algorithms", "Neural networks are powerful" ]

Encode corpus

corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

Query

query = "What is Python?" query_embedding = model.encode(query, convert_to_tensor=True)

Find most similar

hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3) print(hits)

Similarity computation

Cosine similarity

similarity = util.cos_sim(embedding1, embedding2)

Dot product

similarity = util.dot_score(embedding1, embedding2)

Pairwise cosine similarity

similarities = util.cos_sim(embeddings, embeddings)

Batch encoding

Efficient batch processing

sentences = ["sentence 1", "sentence 2", ...] * 1000

embeddings = model.encode( sentences, batch_size=32, show_progress_bar=True, convert_to_tensor=False # or True for PyTorch tensors )

Fine-tuning from sentence_transformers import InputExample, losses from torch.utils.data import DataLoader

Training data

train_examples = [ InputExample(texts=['sentence 1', 'sentence 2'], label=0.8), InputExample(texts=['sentence 3', 'sentence 4'], label=0.3), ]

train_dataloader = DataLoader(train_examples, batch_size=16)

Loss function

train_loss = losses.CosineSimilarityLoss(model)

Train

model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=10, warmup_steps=100 )

Save

model.save('my-finetuned-model')

LangChain integration from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-mpnet-base-v2" )

Use with vector stores

from langchain_chroma import Chroma

vectorstore = Chroma.from_documents( documents=docs, embedding=embeddings )

LlamaIndex integration from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding( model_name="sentence-transformers/all-mpnet-base-v2" )

from llama_index.core import Settings Settings.embed_model = embed_model

Use in index

index = VectorStoreIndex.from_documents(documents)

Model selection guide Model Dimensions Speed Quality Use Case all-MiniLM-L6-v2 384 Fast Good General, prototyping all-mpnet-base-v2 768 Medium Better Production RAG all-roberta-large-v1 1024 Slow Best High accuracy needed paraphrase-multilingual 768 Medium Good Multilingual Best practices Start with all-MiniLM-L6-v2 - Good baseline Normalize embeddings - Better for cosine similarity Use GPU if available - 10× faster encoding Batch encoding - More efficient Cache embeddings - Expensive to recompute Fine-tune for domain - Improves quality Test different models - Quality varies by task Monitor memory - Large models need more RAM Performance Model Speed (sentences/sec) Memory Dimension MiniLM ~2000 120MB 384 MPNet ~600 420MB 768 RoBERTa ~300 1.3GB 1024 Resources GitHub: https://github.com/UKPLab/sentence-transformers ⭐ 15,700+ Models: https://huggingface.co/sentence-transformers Docs: https://www.sbert.net License: Apache 2.0

返回排行榜