Embedding Specifications
| Property | Value |
|---|---|
| Base dimension | 1024 |
| Matryoshka slicing | 1024, 512, 256, 128, 64, 32, 16 |
| Distance metric | Cosine similarity |
| Normalization | L2 normalized |
Why not PCA, t-SNE, or UMAP?
Why not PCA, t-SNE, or UMAP?
Matryoshka embeddings are fundamentally different from post-hoc dimensionality reduction:
With Matryoshka, dimensionality selection becomes a hyperparameter you can tune at zero marginal cost — no recomputation, no projection matrices, no information loss from post-hoc transforms.
| Approach | How it works | Trade-offs |
|---|---|---|
| Matryoshka | Model is trained to encode the most important information in earlier dimensions. Prefix slices are semantically valid by design. | Zero compute at inference — just slice the array. |
| PCA | Linear projection fitted on existing embeddings. | Loses non-linear structure. Requires fitting and storing projection matrix. |
| t-SNE | Non-linear transform optimized for 2D/3D visualization. | Expensive to compute. Not designed for downstream ML tasks. |
| UMAP | Non-linear, better than t-SNE for ML. | Still requires fitting. New samples need transform step. |
Dimension Selection
| Dimension | Use Case | Model Type |
|---|---|---|
| 16-64 | Resource-constrained, simple rules | Decision trees, logistic regression |
| 128 | Latency-sensitive, linear models | Online scoring, real-time APIs |
| 256-512 | Balanced performance | XGBoost, LightGBM, CatBoost |
| 1024 | Maximum signal | Deep learning, vector retrieval |
Start with 256-d for tree-based models. Only scale up if validation metrics improve.
Late Fusion Pattern
The recommended integration approach: combine embeddings with your features in a tree-based model.Hyperparameter search (embedding dimension)
Treat the embedding dimension as a tunable hyperparameter. Because the embeddings are Matryoshka-sliced, you can evaluate multiple dimensions without re-embedding.Similarity Search
Find entities similar to a seed set:Caching Strategy
Persist embeddings with metadata for reproducibility:- A new GFM or RFM snapshot is promoted (webhook notification)
- A downstream model retrains and feeds signal back into your RFM