Chroma - Open-Source Embedding Database

The AI-native database for building LLM applications with memory.

When to use Chroma

Use Chroma when:

Building RAG (retrieval-augmented generation) applications Need local/self-hosted vector database Want open-source solution (Apache 2.0) Prototyping in notebooks Semantic search over documents Storing embeddings with metadata

Metrics:

24,300+ GitHub stars 1,900+ forks v1.3.3 (stable, weekly releases) Apache 2.0 license

Use alternatives instead:

Pinecone: Managed cloud, auto-scaling FAISS: Pure similarity search, no metadata Weaviate: Production ML-native database Qdrant: High performance, Rust-based Quick start Installation

Python

pip install chromadb

JavaScript/TypeScript

npm install chromadb @chroma-core/default-embed

Basic usage (Python) import chromadb

Create client

client = chromadb.Client()

Create collection

collection = client.create_collection(name="my_collection")

Add documents

collection.add( documents=["This is document 1", "This is document 2"], metadatas=[{"source": "doc1"}, {"source": "doc2"}], ids=["id1", "id2"] )

Query

results = collection.query( query_texts=["document about topic"], n_results=2 )

print(results)

Core operations 1. Create collection

Simple collection

collection = client.create_collection("my_docs")

With custom embedding function

from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key="your-key", model_name="text-embedding-3-small" )

collection = client.create_collection( name="my_docs", embedding_function=openai_ef )

Get existing collection

collection = client.get_collection("my_docs")

Delete collection

client.delete_collection("my_docs")

Add documents

Add with auto-generated IDs

collection.add( documents=["Doc 1", "Doc 2", "Doc 3"], metadatas=[ {"source": "web", "category": "tutorial"}, {"source": "pdf", "page": 5}, {"source": "api", "timestamp": "2025-01-01"} ], ids=["id1", "id2", "id3"] )

Add with custom embeddings

collection.add( embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]], documents=["Doc 1", "Doc 2"], ids=["id1", "id2"] )

Query (similarity search)

Basic query

results = collection.query( query_texts=["machine learning tutorial"], n_results=5 )

Query with filters

results = collection.query( query_texts=["Python programming"], n_results=3, where={"source": "web"} )

Query with metadata filters

results = collection.query( query_texts=["advanced topics"], where={ "$and": [ {"category": "tutorial"}, {"difficulty": {"$gte": 3}} ] } )

Access results

print(results["documents"]) # List of matching documents print(results["metadatas"]) # Metadata for each doc print(results["distances"]) # Similarity scores print(results["ids"]) # Document IDs

Get documents

Get by IDs

docs = collection.get( ids=["id1", "id2"] )

Get with filters

docs = collection.get( where={"category": "tutorial"}, limit=10 )

Get all documents

docs = collection.get()

Update documents

Update document content

collection.update( ids=["id1"], documents=["Updated content"], metadatas=[{"source": "updated"}] )

Delete documents

Delete by IDs

collection.delete(ids=["id1", "id2"])

Delete with filter

collection.delete( where={"source": "outdated"} )

Persistent storage

Persist to disk

client = chromadb.PersistentClient(path="./chroma_db")

collection = client.create_collection("my_docs") collection.add(documents=["Doc 1"], ids=["id1"])

Data persisted automatically

Reload later with same path

client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_collection("my_docs")

Embedding functions Default (Sentence Transformers)

Uses sentence-transformers by default

collection = client.create_collection("my_docs")

Default model: all-MiniLM-L6-v2

OpenAI from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key="your-key", model_name="text-embedding-3-small" )

collection = client.create_collection( name="openai_docs", embedding_function=openai_ef )

HuggingFace huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction( api_key="your-key", model_name="sentence-transformers/all-mpnet-base-v2" )

collection = client.create_collection( name="hf_docs", embedding_function=huggingface_ef )

Custom embedding function from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction): def call(self, input: Documents) -> Embeddings: # Your embedding logic return embeddings

my_ef = MyEmbeddingFunction() collection = client.create_collection( name="custom_docs", embedding_function=my_ef )

Metadata filtering

Exact match

results = collection.query( query_texts=["query"], where={"category": "tutorial"} )

Comparison operators

results = collection.query( query_texts=["query"], where={"page": {"$gt": 10}} # $gt, $gte, $lt, $lte, $ne )

Logical operators

results = collection.query( query_texts=["query"], where={ "$and": [ {"category": "tutorial"}, {"difficulty": {"$lte": 3}} ] } # Also: $or )

Contains

results = collection.query( query_texts=["query"], where={"tags": {"$in": ["python", "ml"]}} )

LangChain integration from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter

Split documents

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000) docs = text_splitter.split_documents(documents)

Create Chroma vector store

vectorstore = Chroma.from_documents( documents=docs, embedding=OpenAIEmbeddings(), persist_directory="./chroma_db" )

Query

results = vectorstore.similarity_search("machine learning", k=3)

As retriever

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

LlamaIndex integration from llama_index.vector_stores.chroma import ChromaVectorStore from llama_index.core import VectorStoreIndex, StorageContext import chromadb

Initialize Chroma

db = chromadb.PersistentClient(path="./chroma_db") collection = db.get_or_create_collection("my_collection")

Create vector store

vector_store = ChromaVectorStore(chroma_collection=collection) storage_context = StorageContext.from_defaults(vector_store=vector_store)

Create index

index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

Query

query_engine = index.as_query_engine() response = query_engine.query("What is machine learning?")

Server mode

Run Chroma server

Terminal: chroma run --path ./chroma_db --port 8000

Connect to server

import chromadb from chromadb.config import Settings

client = chromadb.HttpClient( host="localhost", port=8000, settings=Settings(anonymized_telemetry=False) )

Use as normal

collection = client.get_or_create_collection("my_docs")

Best practices Use persistent client - Don't lose data on restart Add metadata - Enables filtering and tracking Batch operations - Add multiple docs at once Choose right embedding model - Balance speed/quality Use filters - Narrow search space Unique IDs - Avoid collisions Regular backups - Copy chroma_db directory Monitor collection size - Scale up if needed Test embedding functions - Ensure quality Use server mode for production - Better for multi-user Performance Operation Latency Notes Add 100 docs ~1-3s With embedding Query (top 10) ~50-200ms Depends on collection size Metadata filter ~10-50ms Fast with proper indexing Resources GitHub: https://github.com/chroma-core/chroma ⭐ 24,300+ Docs: https://docs.trychroma.com Discord: https://discord.gg/MMeYNTmh3x Version: 1.3.3+ License: Apache 2.0

安装

Python

JavaScript/TypeScript

Create client

Create collection

Add documents

Query

Simple collection

With custom embedding function

Get existing collection

Delete collection

Add with auto-generated IDs

Add with custom embeddings

Basic query

Query with filters

Query with metadata filters

Access results

Get by IDs

Get with filters

Get all documents

Update document content

Delete by IDs

Delete with filter

Persist to disk

Data persisted automatically

Reload later with same path

Uses sentence-transformers by default

Default model: all-MiniLM-L6-v2

Exact match

Comparison operators

Logical operators

Contains

Split documents

Create Chroma vector store

Query

As retriever

Initialize Chroma

Create vector store

Create index

Query

Run Chroma server

Terminal: chroma run --path ./chroma_db --port 8000

Connect to server

Use as normal