Skip to main content

Embedding Specifications

PropertyValue
Base dimension1024
Matryoshka slicing1024, 512, 256, 128, 64, 32, 16
Distance metricCosine similarity
NormalizationL2 normalized
Slice embeddings client-side by taking the first N elements:
embedding_1024 = response["vector"]  # Full 1024-d
embedding_256 = embedding_1024[:256]  # Slice to 256-d
Slice only prefix dimensions we explicitly support (16/32/64/128/256/512/1024). Avoid arbitrary cuts or mixing non-prefix subsets across experiments — that discards the Matryoshka structure and reduces information entropy.
Matryoshka embeddings are fundamentally different from post-hoc dimensionality reduction:
ApproachHow it worksTrade-offs
MatryoshkaModel is trained to encode the most important information in earlier dimensions. Prefix slices are semantically valid by design.Zero compute at inference — just slice the array.
PCALinear projection fitted on existing embeddings.Loses non-linear structure. Requires fitting and storing projection matrix.
t-SNENon-linear transform optimized for 2D/3D visualization.Expensive to compute. Not designed for downstream ML tasks.
UMAPNon-linear, better than t-SNE for ML.Still requires fitting. New samples need transform step.
With Matryoshka, dimensionality selection becomes a hyperparameter you can tune at zero marginal cost — no recomputation, no projection matrices, no information loss from post-hoc transforms.

Dimension Selection

DimensionUse CaseModel Type
16-64Resource-constrained, simple rulesDecision trees, logistic regression
128Latency-sensitive, linear modelsOnline scoring, real-time APIs
256-512Balanced performanceXGBoost, LightGBM, CatBoost
1024Maximum signalDeep learning, vector retrieval
Start with 256-d for tree-based models. Only scale up if validation metrics improve.

Late Fusion Pattern

The recommended integration approach: combine embeddings with your features in a tree-based model.
import pandas as pd
from lightgbm import LGBMClassifier

# Your features
df = pd.DataFrame({
    "revenue": [...],
    "age_months": [...],
    "bureau_score": [...],
    "default": [...]  # target
})

# Add Avra embeddings (256-d for tree models)
embedding_cols = [f"emb_{i}" for i in range(256)]
for i, col in enumerate(embedding_cols):
    df[col] = [emb[i] for emb in embeddings_256d]

# Train
X = df.drop("default", axis=1)
y = df["default"]

model = LGBMClassifier()
model.fit(X, y)

Hyperparameter search (embedding dimension)

Treat the embedding dimension as a tunable hyperparameter. Because the embeddings are Matryoshka-sliced, you can evaluate multiple dimensions without re-embedding.
import numpy as np
import optuna
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# From the Late Fusion example above
embedding_1024 = np.array(embeddings_1024d)  # (n_samples, 1024)
base_features = df[["revenue", "age_months", "bureau_score"]].values
labels = df["default"].values

idx = np.arange(len(df))
train_idx, val_idx, y_train, y_val = train_test_split(
    idx, labels, test_size=0.2, random_state=42, stratify=labels
)

X_train_base = base_features[train_idx]
X_val_base = base_features[val_idx]

def objective(trial):
    dim = trial.suggest_categorical("embedding_dim", [16, 32, 64, 128, 256, 512, 1024])

    emb_train = embedding_1024[train_idx, :dim]
    emb_val = embedding_1024[val_idx, :dim]

    X_train = np.hstack([X_train_base, emb_train])
    X_val = np.hstack([X_val_base, emb_val])

    model = lgb.LGBMClassifier(
        n_estimators=trial.suggest_int("n_estimators", 200, 1200),
        learning_rate=trial.suggest_float("learning_rate", 0.01, 0.2, log=True),
        num_leaves=trial.suggest_int("num_leaves", 31, 255),
        max_depth=trial.suggest_int("max_depth", 3, 10),
        subsample=trial.suggest_float("subsample", 0.6, 1.0),
        colsample_bytree=trial.suggest_float("colsample_bytree", 0.6, 1.0),
        random_state=42,
        verbosity=-1
    )

    model.fit(
        X_train,
        y_train,
        eval_set=[(X_val, y_val)],
        callbacks=[lgb.early_stopping(50, verbose=False)]
    )

    preds = model.predict_proba(X_val)[:, 1]
    return roc_auc_score(y_val, preds)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50, show_progress_bar=True)

print(f"Best AUC: {study.best_value:.4f}")
print(f"Best dimension: {study.best_params['embedding_dim']}")
Find entities similar to a seed set:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def find_similar(seed_embedding, candidate_embeddings, top_k=100):
    similarities = cosine_similarity([seed_embedding], candidate_embeddings)[0]
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return top_indices, similarities[top_indices]

# Find companies similar to your best customers
best_customer_emb = get_embedding("12345678000199")
similar_idx, scores = find_similar(best_customer_emb, all_embeddings, top_k=1000)

Caching Strategy

Persist embeddings with metadata for reproducibility:
embedding_response = {
    "model_snapshot": "...",
    "generated_at": "..."
}

cache_record = {
    "legal_document": "12345678000199",
    "vector": embedding,
    "model_snapshot": embedding_response["model_snapshot"],
    "generated_at": embedding_response["generated_at"],
    "dimension": 1024
}
Refresh when:
  • A new GFM or RFM snapshot is promoted (webhook notification)
  • A downstream model retrains and feeds signal back into your RFM

Monitoring

Track embedding quality over time:
# Monitor distribution drift
from scipy.stats import ks_2samp

historical_norms = [np.linalg.norm(e) for e in historical_embeddings]
current_norms = [np.linalg.norm(e) for e in current_embeddings]

stat, pvalue = ks_2samp(historical_norms, current_norms)
if pvalue < 0.05:
    alert("Embedding distribution shift detected")