Working with Chroma Vector Database: Installation, Operations, and API Reference

Overview

Chroma serves as an open-source vector database designed for building AI applications with embeddings. It provides SDKs for both Python and JavaScript/TypeScript, along with a server component. The platform focuses on developer productivity, offering tools to store embeddings and metadata, embed documents and queries, and perform similarity searches.

Installation

Install the Python package:

pip install chromadb

For JavaScript projects:

npm install chromadb

Verify the installation by accessing help documentation:

chroma --help
chroma docs

Client Initialization

In-Memory Client

Create a transient in-memory client:

import chromadb

client = chromadb.Client()

Persistent Storage

For data persistence across sessions:

import chromadb

storage_path = "/local/data/vector_store"
client = chromadb.PersistentClient(path=storage_path)

HTTP Client for Server Mode

Connect to a running Chroma server:

import chromadb

client = chromadb.HttpClient(host='localhost', port=8000)

A lightweight client-only package is available for server deployments:

pip install chromadb-client

Health Checks

Verify connectivity using the heartbeat method:

timestamp = client.heartbeat()
# Returns nanosecond timestamp

Collections Management

Collection Naming Rules

Collection names must adhere to specific constraints:

  • Length between 3 and 63 characters
  • Must begin and end with lowercase letter or digit
  • Can contain dots, dashes, and underscores in between
  • Cannot contain consecutive dots
  • Cannot be a valid IP address

Creating Collections

# Create a new collection
my_collection = client.create_collection(name="documents_store")

# Get existing collection
my_collection = client.get_collection(name="documents_store")

# Get or create
my_collection = client.get_or_create_collection(name="documents_store")

# Delete collection
client.delete_collection(name="documents_store")

Distance Metrics

Customize the distance function through metadata:

collection = client.create_collection(
    name="cosine_collection",
    metadata={"hnsw:space": "cosine"}
)

Available distance metrics:

  • l2 (default): Squared L2 norm
  • ip: Inner product
  • cosine: Cosine similarity

Collection Operations

# Count items
total_items = collection.count()

# Preview first few items
preview_data = collection.peek(limit=10)

# Rename collection
collection.modify(name="renamed_collection")

Data Operations

Adding Data

Insert documents with automatic embedding:

collection.add(
    documents=["First document text", "Second document text"],
    metadatas=[{"category": "tech"}, {"category": "science"}],
    ids=["doc_001", "doc_002"]
)

Or provide pre-computed embeddings:

collection.add(
    embeddings=[[0.12, 0.34, 0.56], [0.78, 0.90, 0.12]],
    documents=["Document one", "Document two"],
    metadatas=[{"source": "web"}, {"source": "file"}],
    ids=["vec_001", "vec_002"]
)

Querying Data

Search using embeddings:

results = collection.query(
    query_embeddings=[[0.11, 0.22, 0.33], [0.44, 0.55, 0.66]],
    n_results=5
)

Search using text (automatically embedded):

results = collection.query(
    query_texts=["search query text"],
    n_results=10
)

Retrieve specific items by ID:

items = collection.get(
    ids=["doc_001", "doc_002"],
    where={"category": "tech"}
)

Filtering Results

Specify which fields to return:

results = collection.get(
    include=["documents", "metadatas"]
)

results = collection.query(
    query_embeddings=[[0.11, 0.22, 0.33]],
    include=["documents", "distances"]
)

Metadata Filtering

Use where clauses to filter by metadata:

# Equality filter
results = collection.get(where={"category": "tech"})

# Comparison operators
results = collection.get(
    where={"score": {"$gt": 50}}
)

# Logical operators
results = collection.get(
    where={
        "$and": [
            {"category": {"$eq": "tech"}},
            {"year": {"$gte": 2023}}
        ]
    }
)

Supported operators: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin

Document Content Filtering

# Contains filter
results = collection.get(
    where_document={"$contains": "keyword"}
)

# Not contains
results = collection.get(
    where_document={"$not_contains": "excluded_term"}
)

Updating Data

# Update existing items
collection.update(
    ids=["doc_001"],
    documents=["Updated document text"],
    metadatas=[{"category": "updated"}]
)

# Upsert - update or insert
collection.upsert(
    ids=["doc_003"],
    documents=["New or updated document"],
    metadatas=[{"category": "new"}]
)

Deleting Data

collection.delete(
    ids=["doc_001", "doc_002"],
    where={"category": "obsolete"}
)

Server Deployment

Starting the Server

chroma run --path /data/chroma --host localhost --port 8000

Server Options

  • --path: Data storage directory (default: ./chroma_data)
  • --host: Server host (default: localhost)
  • --port: Server port (default: 8000)
  • --log-path: Log file path

Authentication

Basic Authentication

Generate credentials on the server:

htpasswd -Bbn admin_user secure_password > server.htpasswd

Configure server environment:

export CHROMA_SERVER_AUTH_CREDENTIALS_FILE="server.htpasswd"
export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER="chromadb.auth.providers.HtpasswdFileServerAuthCredentialsProvider"
export CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.basic.BasicAuthServerProvider"

Client configuration:

from chromadb.config import Settings

client = chromadb.HttpClient(
    settings=Settings(
        chroma_client_auth_provider="chromadb.auth.basic.BasicAuthClientProvider",
        chroma_client_auth_credentials="admin_user:secure_password"
    )
)

Token Authentication

Server configuration:

export CHROMA_SERVER_AUTH_CREDENTIALS="api_token_value"
export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER="chromadb.auth.token.TokenConfigServerAuthCredentialsProvider"
export CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.token.TokenAuthServerProvider"

Client configuration:

from chromadb.config import Settings

client = chromadb.HttpClient(
    settings=Settings(
        chroma_client_auth_provider="chromadb.auth.token.TokenAuthClientProvider",
        chroma_client_auth_credentials="api_token_value"
    )
)

Embedding Functions

Default Embedding

Chroma uses all-MiniLM-L6-v2 from Sentence Transformers by default:

from chromadb.utils import embedding_functions

default_ef = embedding_functions.DefaultEmbeddingFunction()
embeddings = default_ef(["sample text"])

Sentence Transformers

st_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)</nembeddings = st_ef(["text to embed"])</code>

Custom Embedding Function

from chromadb import Documents, EmbeddingFunction, Embeddings

class CustomEmbedder(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # Custom embedding logic
        return computed_embeddings

collection = client.create_collection(
    name="custom_embeds",
    embedding_function=CustomEmbedder()
)

Supported Providers

Chroma provides wrappers for multiple embedding providers including OpenAI, Google Generative AI, Cohere, Hugging Face, and Jina AI.

Multi-Modal Support

Chroma supports multi-modal collections for handling images and text together.

Setup

from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader

img_loader = ImageLoader()
multi_ef = OpenCLIPEmbeddingFunction()

multi_collection = client.create_collection(
    name='multimodal_data',
    embedding_function=multi_ef,
    data_loader=img_loader
)

Adding Multi-Modal Data

# Add images
collection.add(
    ids=['img_001', 'img_002'],
    images=[image_array_1, image_array_2]
)

# Add text to same collection
collection.add(
    ids=['txt_001', 'txt_002'],
    texts=["Text description one", "Text description two"]
)

# Add via URIs
collection.add(
    ids=['uri_001', 'uri_002'],
    uris=['file:///path/to/image1.jpg', 'file:///path/to/image2.png']
)

Querying Multi-Modal Collections

# Query with image
results = collection.query(
    query_images=[query_image_array]
)

# Query with text
results = collection.query(
    query_texts=["search description"]
)

# Query with URI
results = collection.query(
    query_uris=['file:///path/to/query.jpg']
)

API Quick Reference

Client Methods

client.list_collections()
client.create_collection(name="name")
client.get_collection(name="name")
client.get_or_create_collection(name="name")
client.delete_collection(name="name")
client.reset()  # Clears all data
client.heartbeat()  # Health check

Collection Methods

collection.add(documents=[], ids=[], embeddings=[], metadatas=[])
collection.update(ids=[], documents=[], embeddings=[], metadatas=[])
collection.upsert(ids=[], documents=[], embeddings=[], metadatas=[])
collection.get(ids=[], where={}, where_document={})
collection.query(query_embeddings=[], n_results=10, where={})
collection.delete(ids=[], where={})
collection.count()
collection.peek(limit=5)
collection.modify(name="new_name")

Tags: vector database Chroma python Embeddings Machine Learning

Posted on Sun, 17 May 2026 00:53:50 +0000 by amit.patel