When working with high-dimensional embeddings—such as 256-dimensional vectors that lie on a hypersphere after training—it's often useful to project them into 2D or 3D space to inspect cluster structure or class separation.
Two widely used techniques for this purpose are Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). PCA is a linear method that preserves global variance, while t-SNE is non-linear and better at revealing local clusters by modeling pairwise similarities.
Below is a Python implementation using scikit-learn and plotly to generate interactive 3D visualizations:
import os
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.graph_objects as go
def plot_embeddings_3d(embeddings, output_file, group_ids=None):
"""
Project high-dimensional embeddings into 3D using PCA and t-SNE.
Args:
embeddings: numpy array of shape (n_samples, n_features)
output_file: base path for saving HTML plots
group_ids: optional labels for coloring points
"""
base_name = os.path.splitext(output_file)[0]
# PCA projection
print("Running PCA...")
reducer_pca = PCA(n_components=3)
proj_pca = reducer_pca.fit_transform(embeddings)
# t-SNE projection
print("Running t-SNE...")
reducer_tsne = TSNE(n_components=3, perplexity=30, learning_rate=200, random_state=42)
proj_tsne = reducer_tsne.fit_transform(embeddings)
# Plot PCA
fig1 = go.Figure(data=go.Scatter3d(
x=proj_pca[:, 0],
y=proj_pca[:, 1],
z=proj_pca[:, 2],
mode='markers',
marker=dict(size=4, color=group_ids, opacity=0.7)
))
fig1.update_layout(title="PCA Projection", scene=dict(
xaxis_title="PC1",
yaxis_title="PC2",
zaxis_title="PC3"
))
fig1.write_html(f"{base_name}_pca.html")
# Plot t-SNE
fig2 = go.Figure(data=go.Scatter3d(
x=proj_tsne[:, 0],
y=proj_tsne[:, 1],
z=proj_tsne[:, 2],
mode='markers',
marker=dict(size=4, color=group_ids, opacity=0.7)
))
fig2.update_layout(title="t-SNE Projection", scene=dict(
xaxis_title="Dim 1",
yaxis_title="Dim 2",
zaxis_title="Dim 3"
))
fig2.write_html(f"{base_name}_tsne.html")
This function accepts an (N, 256) embedding matrix and optionally a list of class or cluster labels. It outputs two interactive HTML files—one for each projection—allowing rotation and zoom to explore spatial relationhsips. Note that t-SNE results may vary between runs due to its stochastic nature; setting random_state ensures reproducibility.