Overview
Chroma serves as an open-source vector database designed for building AI applications with embeddings. It provides SDKs for both Python and JavaScript/TypeScript, along with a server component. The platform focuses on developer productivity, offering tools to store embeddings and metadata, embed documents and queries, and perform similarity searches.
Installation
Install the Python package:
pip install chromadbFor JavaScript projects:
npm install chromadbVerify the installation by accessing help documentation:
chroma --help
chroma docsClient Initialization
In-Memory Client
Create a transient in-memory client:
import chromadb
client = chromadb.Client()Persistent Storage
For data persistence across sessions:
import chromadb
storage_path = "/local/data/vector_store"
client = chromadb.PersistentClient(path=storage_path)HTTP Client for Server Mode
Connect to a running Chroma server:
import chromadb
client = chromadb.HttpClient(host='localhost', port=8000)A lightweight client-only package is available for server deployments:
pip install chromadb-clientHealth Checks
Verify connectivity using the heartbeat method:
timestamp = client.heartbeat()
# Returns nanosecond timestampCollections Management
Collection Naming Rules
Collection names must adhere to specific constraints:
- Length between 3 and 63 characters
- Must begin and end with lowercase letter or digit
- Can contain dots, dashes, and underscores in between
- Cannot contain consecutive dots
- Cannot be a valid IP address
Creating Collections
# Create a new collection
my_collection = client.create_collection(name="documents_store")
# Get existing collection
my_collection = client.get_collection(name="documents_store")
# Get or create
my_collection = client.get_or_create_collection(name="documents_store")
# Delete collection
client.delete_collection(name="documents_store")Distance Metrics
Customize the distance function through metadata:
collection = client.create_collection(
name="cosine_collection",
metadata={"hnsw:space": "cosine"}
)Available distance metrics:
- l2 (default): Squared L2 norm
- ip: Inner product
- cosine: Cosine similarity
Collection Operations
# Count items
total_items = collection.count()
# Preview first few items
preview_data = collection.peek(limit=10)
# Rename collection
collection.modify(name="renamed_collection")Data Operations
Adding Data
Insert documents with automatic embedding:
collection.add(
documents=["First document text", "Second document text"],
metadatas=[{"category": "tech"}, {"category": "science"}],
ids=["doc_001", "doc_002"]
)Or provide pre-computed embeddings:
collection.add(
embeddings=[[0.12, 0.34, 0.56], [0.78, 0.90, 0.12]],
documents=["Document one", "Document two"],
metadatas=[{"source": "web"}, {"source": "file"}],
ids=["vec_001", "vec_002"]
)Querying Data
Search using embeddings:
results = collection.query(
query_embeddings=[[0.11, 0.22, 0.33], [0.44, 0.55, 0.66]],
n_results=5
)Search using text (automatically embedded):
results = collection.query(
query_texts=["search query text"],
n_results=10
)Retrieve specific items by ID:
items = collection.get(
ids=["doc_001", "doc_002"],
where={"category": "tech"}
)Filtering Results
Specify which fields to return:
results = collection.get(
include=["documents", "metadatas"]
)
results = collection.query(
query_embeddings=[[0.11, 0.22, 0.33]],
include=["documents", "distances"]
)Metadata Filtering
Use where clauses to filter by metadata:
# Equality filter
results = collection.get(where={"category": "tech"})
# Comparison operators
results = collection.get(
where={"score": {"$gt": 50}}
)
# Logical operators
results = collection.get(
where={
"$and": [
{"category": {"$eq": "tech"}},
{"year": {"$gte": 2023}}
]
}
)Supported operators: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin
Document Content Filtering
# Contains filter
results = collection.get(
where_document={"$contains": "keyword"}
)
# Not contains
results = collection.get(
where_document={"$not_contains": "excluded_term"}
)Updating Data
# Update existing items
collection.update(
ids=["doc_001"],
documents=["Updated document text"],
metadatas=[{"category": "updated"}]
)
# Upsert - update or insert
collection.upsert(
ids=["doc_003"],
documents=["New or updated document"],
metadatas=[{"category": "new"}]
)Deleting Data
collection.delete(
ids=["doc_001", "doc_002"],
where={"category": "obsolete"}
)Server Deployment
Starting the Server
chroma run --path /data/chroma --host localhost --port 8000Server Options
--path: Data storage directory (default: ./chroma_data)--host: Server host (default: localhost)--port: Server port (default: 8000)--log-path: Log file path
Authentication
Basic Authentication
Generate credentials on the server:
htpasswd -Bbn admin_user secure_password > server.htpasswdConfigure server environment:
export CHROMA_SERVER_AUTH_CREDENTIALS_FILE="server.htpasswd"
export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER="chromadb.auth.providers.HtpasswdFileServerAuthCredentialsProvider"
export CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.basic.BasicAuthServerProvider"Client configuration:
from chromadb.config import Settings
client = chromadb.HttpClient(
settings=Settings(
chroma_client_auth_provider="chromadb.auth.basic.BasicAuthClientProvider",
chroma_client_auth_credentials="admin_user:secure_password"
)
)Token Authentication
Server configuration:
export CHROMA_SERVER_AUTH_CREDENTIALS="api_token_value"
export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER="chromadb.auth.token.TokenConfigServerAuthCredentialsProvider"
export CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.token.TokenAuthServerProvider"Client configuration:
from chromadb.config import Settings
client = chromadb.HttpClient(
settings=Settings(
chroma_client_auth_provider="chromadb.auth.token.TokenAuthClientProvider",
chroma_client_auth_credentials="api_token_value"
)
)Embedding Functions
Default Embedding
Chroma uses all-MiniLM-L6-v2 from Sentence Transformers by default:
from chromadb.utils import embedding_functions
default_ef = embedding_functions.DefaultEmbeddingFunction()
embeddings = default_ef(["sample text"])Sentence Transformers
st_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)</nembeddings = st_ef(["text to embed"])</code>Custom Embedding Function
from chromadb import Documents, EmbeddingFunction, Embeddings
class CustomEmbedder(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
# Custom embedding logic
return computed_embeddings
collection = client.create_collection(
name="custom_embeds",
embedding_function=CustomEmbedder()
)Supported Providers
Chroma provides wrappers for multiple embedding providers including OpenAI, Google Generative AI, Cohere, Hugging Face, and Jina AI.
Multi-Modal Support
Chroma supports multi-modal collections for handling images and text together.
Setup
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader
img_loader = ImageLoader()
multi_ef = OpenCLIPEmbeddingFunction()
multi_collection = client.create_collection(
name='multimodal_data',
embedding_function=multi_ef,
data_loader=img_loader
)Adding Multi-Modal Data
# Add images
collection.add(
ids=['img_001', 'img_002'],
images=[image_array_1, image_array_2]
)
# Add text to same collection
collection.add(
ids=['txt_001', 'txt_002'],
texts=["Text description one", "Text description two"]
)
# Add via URIs
collection.add(
ids=['uri_001', 'uri_002'],
uris=['file:///path/to/image1.jpg', 'file:///path/to/image2.png']
)Querying Multi-Modal Collections
# Query with image
results = collection.query(
query_images=[query_image_array]
)
# Query with text
results = collection.query(
query_texts=["search description"]
)
# Query with URI
results = collection.query(
query_uris=['file:///path/to/query.jpg']
)API Quick Reference
Client Methods
client.list_collections()
client.create_collection(name="name")
client.get_collection(name="name")
client.get_or_create_collection(name="name")
client.delete_collection(name="name")
client.reset() # Clears all data
client.heartbeat() # Health checkCollection Methods
collection.add(documents=[], ids=[], embeddings=[], metadatas=[])
collection.update(ids=[], documents=[], embeddings=[], metadatas=[])
collection.upsert(ids=[], documents=[], embeddings=[], metadatas=[])
collection.get(ids=[], where={}, where_document={})
collection.query(query_embeddings=[], n_results=10, where={})
collection.delete(ids=[], where={})
collection.count()
collection.peek(limit=5)
collection.modify(name="new_name")