chroma

安装量: 215
排名: #4072

安装

npx skills add https://github.com/davila7/claude-code-templates --skill chroma

Chroma - Open-Source Embedding Database

The AI-native database for building LLM applications with memory.

When to use Chroma

Use Chroma when:

Building RAG (retrieval-augmented generation) applications Need local/self-hosted vector database Want open-source solution (Apache 2.0) Prototyping in notebooks Semantic search over documents Storing embeddings with metadata

Metrics:

24,300+ GitHub stars 1,900+ forks v1.3.3 (stable, weekly releases) Apache 2.0 license

Use alternatives instead:

Pinecone: Managed cloud, auto-scaling FAISS: Pure similarity search, no metadata Weaviate: Production ML-native database Qdrant: High performance, Rust-based Quick start Installation

Python

pip install chromadb

JavaScript/TypeScript

npm install chromadb @chroma-core/default-embed

Basic usage (Python) import chromadb

Create client

client = chromadb.Client()

Create collection

collection = client.create_collection(name="my_collection")

Add documents

collection.add( documents=["This is document 1", "This is document 2"], metadatas=[{"source": "doc1"}, {"source": "doc2"}], ids=["id1", "id2"] )

Query

results = collection.query( query_texts=["document about topic"], n_results=2 )

print(results)

Core operations 1. Create collection

Simple collection

collection = client.create_collection("my_docs")

With custom embedding function

from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key="your-key", model_name="text-embedding-3-small" )

collection = client.create_collection( name="my_docs", embedding_function=openai_ef )

Get existing collection

collection = client.get_collection("my_docs")

Delete collection

client.delete_collection("my_docs")

  1. Add documents

Add with auto-generated IDs

collection.add( documents=["Doc 1", "Doc 2", "Doc 3"], metadatas=[ {"source": "web", "category": "tutorial"}, {"source": "pdf", "page": 5}, {"source": "api", "timestamp": "2025-01-01"} ], ids=["id1", "id2", "id3"] )

Add with custom embeddings

collection.add( embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]], documents=["Doc 1", "Doc 2"], ids=["id1", "id2"] )

  1. Query (similarity search)

Basic query

results = collection.query( query_texts=["machine learning tutorial"], n_results=5 )

Query with filters

results = collection.query( query_texts=["Python programming"], n_results=3, where={"source": "web"} )

Query with metadata filters

results = collection.query( query_texts=["advanced topics"], where={ "$and": [ {"category": "tutorial"}, {"difficulty": {"$gte": 3}} ] } )

Access results

print(results["documents"]) # List of matching documents print(results["metadatas"]) # Metadata for each doc print(results["distances"]) # Similarity scores print(results["ids"]) # Document IDs

  1. Get documents

Get by IDs

docs = collection.get( ids=["id1", "id2"] )

Get with filters

docs = collection.get( where={"category": "tutorial"}, limit=10 )

Get all documents

docs = collection.get()

  1. Update documents

Update document content

collection.update( ids=["id1"], documents=["Updated content"], metadatas=[{"source": "updated"}] )

  1. Delete documents

Delete by IDs

collection.delete(ids=["id1", "id2"])

Delete with filter

collection.delete( where={"source": "outdated"} )

Persistent storage

Persist to disk

client = chromadb.PersistentClient(path="./chroma_db")

collection = client.create_collection("my_docs") collection.add(documents=["Doc 1"], ids=["id1"])

Data persisted automatically

Reload later with same path

client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_collection("my_docs")

Embedding functions Default (Sentence Transformers)

Uses sentence-transformers by default

collection = client.create_collection("my_docs")

Default model: all-MiniLM-L6-v2

OpenAI from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key="your-key", model_name="text-embedding-3-small" )

collection = client.create_collection( name="openai_docs", embedding_function=openai_ef )

HuggingFace huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction( api_key="your-key", model_name="sentence-transformers/all-mpnet-base-v2" )

collection = client.create_collection( name="hf_docs", embedding_function=huggingface_ef )

Custom embedding function from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction): def call(self, input: Documents) -> Embeddings: # Your embedding logic return embeddings

my_ef = MyEmbeddingFunction() collection = client.create_collection( name="custom_docs", embedding_function=my_ef )

Metadata filtering

Exact match

results = collection.query( query_texts=["query"], where={"category": "tutorial"} )

Comparison operators

results = collection.query( query_texts=["query"], where={"page": {"$gt": 10}} # $gt, $gte, $lt, $lte, $ne )

Logical operators

results = collection.query( query_texts=["query"], where={ "$and": [ {"category": "tutorial"}, {"difficulty": {"$lte": 3}} ] } # Also: $or )

Contains

results = collection.query( query_texts=["query"], where={"tags": {"$in": ["python", "ml"]}} )

LangChain integration from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter

Split documents

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000) docs = text_splitter.split_documents(documents)

Create Chroma vector store

vectorstore = Chroma.from_documents( documents=docs, embedding=OpenAIEmbeddings(), persist_directory="./chroma_db" )

Query

results = vectorstore.similarity_search("machine learning", k=3)

As retriever

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

LlamaIndex integration from llama_index.vector_stores.chroma import ChromaVectorStore from llama_index.core import VectorStoreIndex, StorageContext import chromadb

Initialize Chroma

db = chromadb.PersistentClient(path="./chroma_db") collection = db.get_or_create_collection("my_collection")

Create vector store

vector_store = ChromaVectorStore(chroma_collection=collection) storage_context = StorageContext.from_defaults(vector_store=vector_store)

Create index

index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

Query

query_engine = index.as_query_engine() response = query_engine.query("What is machine learning?")

Server mode

Run Chroma server

Terminal: chroma run --path ./chroma_db --port 8000

Connect to server

import chromadb from chromadb.config import Settings

client = chromadb.HttpClient( host="localhost", port=8000, settings=Settings(anonymized_telemetry=False) )

Use as normal

collection = client.get_or_create_collection("my_docs")

Best practices Use persistent client - Don't lose data on restart Add metadata - Enables filtering and tracking Batch operations - Add multiple docs at once Choose right embedding model - Balance speed/quality Use filters - Narrow search space Unique IDs - Avoid collisions Regular backups - Copy chroma_db directory Monitor collection size - Scale up if needed Test embedding functions - Ensure quality Use server mode for production - Better for multi-user Performance Operation Latency Notes Add 100 docs ~1-3s With embedding Query (top 10) ~50-200ms Depends on collection size Metadata filter ~10-50ms Fast with proper indexing Resources GitHub: https://github.com/chroma-core/chroma ⭐ 24,300+ Docs: https://docs.trychroma.com Discord: https://discord.gg/MMeYNTmh3x Version: 1.3.3+ License: Apache 2.0

返回排行榜