Embeddings: The Data Structures Powering Modern AI

What Is an Embedding?

An embedding is a mapping from a discrete, high-dimensional object (a word, a sentence, a user, a product) to a dense, low-dimensional vector of real numbers.

"cat"  → [0.23, -0.81, 0.44, 0.12, ...]   # 384 floats
"dog"  → [0.21, -0.79, 0.41, 0.08, ...]   # similar vector
"car"  → [-0.55, 0.33, -0.12, 0.67, ...]  # different vector

The key property: similar things have similar vectors. This transforms the problem of semantic similarity into geometric distance — something computers can compute efficiently.

Why Vectors Work for Semantics

Before embeddings, the dominant representation was one-hot encoding:

vocabulary = ["cat", "dog", "car", "boat"]

# One-hot: sparse, no relationships captured
cat = [1, 0, 0, 0]
dog = [0, 1, 0, 0]
# cos_similarity(cat, dog) = 0  — same as cat vs. car!

One-hot vectors can't express "cat and dog are more similar to each other than to car." Embeddings learn these relationships from data.

How Embeddings Are Learned: Word2Vec Intuition

The original insight: words that appear in similar contexts have similar meanings. Word2Vec operationalizes this:

# Training objective: predict surrounding words
# Given "the cat sat on the ___"
# Predict: "mat" (positive), not "democracy" (negative)

# This forces "cat" and "dog" embeddings to be similar
# because both appear in similar contexts
# ("the ___ sat", "pet ___", "feed the ___")

Modern sentence/text embeddings use transformer models (BERT, RoBERTa, E5) trained on much larger datasets with more sophisticated objectives.

Computing Similarity

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def embed(text: str) -> np.ndarray:
    """Placeholder — use real model below."""
    pass

# Cosine similarity: angle between vectors (ignores magnitude)
def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Real example with sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",     # Semantically similar
    "Machine learning is fascinating",  # Different topic
]

embeddings = model.encode(sentences)  # Shape: (3, 384)

# Similarity matrix
sim_matrix = cosine_similarity(embeddings)
print(sim_matrix[0, 1])  # 0.83 — very similar
print(sim_matrix[0, 2])  # 0.12 — very different

Vector Search: Finding Similar Items at Scale

For small datasets, brute-force search over all vectors works:

def find_similar(query_embedding, corpus_embeddings, top_k=5):
    similarities = cosine_similarity([query_embedding], corpus_embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k]
    return top_indices, similarities[top_indices]

For large datasets (millions+ vectors), approximate nearest neighbor (ANN) libraries are required:

import faiss
import numpy as np

# Build an index
dimension = 384
index = faiss.IndexFlatIP(dimension)  # Inner product (= cosine if normalized)

# Normalize embeddings for cosine similarity
def normalize(x):
    return x / np.linalg.norm(x, axis=1, keepdims=True)

corpus_embeddings = normalize(model.encode(corpus_texts))
faiss.normalize_L2(corpus_embeddings)
index.add(corpus_embeddings)

# Search
query = normalize(model.encode(["similar cats"]))
distances, indices = index.search(query, k=5)
# Returns top-5 most similar items in microseconds

FAISS (Facebook AI Similarity Search) supports billion-scale vector search. Other options: Pinecone, Weaviate, pgvector (PostgreSQL extension), Qdrant.

Embeddings in Production: Common Patterns

Semantic Search

# Index all documents at index time
docs = load_documents()
doc_embeddings = model.encode([d.text for d in docs])

# At query time
def semantic_search(query: str, top_k: int = 10):
    query_embedding = model.encode([query])[0]
    indices, scores = find_similar(query_embedding, doc_embeddings, top_k)
    return [docs[i] for i in indices], scores

Recommendation via User/Item Embeddings

# Two-tower model: separate encoders for users and items
class TwoTowerModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.user_encoder = nn.Sequential(
            nn.Linear(user_feature_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128)
        )
        self.item_encoder = nn.Sequential(
            nn.Linear(item_feature_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128)
        )

    def forward(self, user_features, item_features):
        user_emb = F.normalize(self.user_encoder(user_features), dim=-1)
        item_emb = F.normalize(self.item_encoder(item_features), dim=-1)
        return (user_emb * item_emb).sum(dim=-1)  # Dot product similarity

Train this model, then index all item embeddings in FAISS. At serving time, encode the user and retrieve top-k nearest item embeddings in milliseconds.

Duplicate Detection

def find_duplicates(texts: list[str], threshold: float = 0.92) -> list[tuple]:
    embeddings = model.encode(texts)
    sim_matrix = cosine_similarity(embeddings)

    duplicates = []
    for i in range(len(texts)):
        for j in range(i+1, len(texts)):
            if sim_matrix[i, j] > threshold:
                duplicates.append((i, j, sim_matrix[i, j]))
    return duplicates

Clustering

from sklearn.cluster import KMeans

embeddings = model.encode(texts)
kmeans = KMeans(n_clusters=20, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings)

# Inspect clusters
for cluster_id in range(20):
    cluster_texts = [texts[i] for i, l in enumerate(cluster_labels) if l == cluster_id]
    print(f"Cluster {cluster_id}: {cluster_texts[:3]}")

Choosing an Embedding Model

Model	Size	Dimension	Best for
all-MiniLM-L6-v2	80MB	384	Fast, good general purpose
all-mpnet-base-v2	420MB	768	Better quality, slower
text-embedding-3-small	API	1536	OpenAI, no hosting needed
E5-large-v2	1.3GB	1024	High quality, retrieval-focused
BGE-M3	2.2GB	1024	Multilingual, state-of-the-art

For most production use cases, start with all-MiniLM-L6-v2 (fast, small, good enough). Upgrade if your evaluation metrics show it's the bottleneck.

The Embedding Space Has Structure

Embeddings encode more than similarity — they encode analogical relationships:

king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin

This arithmetic works because the training objective implicitly encodes these relationships. It's why embeddings are powerful: they learn structure from data without explicit supervision.

Next: see how embeddings are used at scale in production recommendation systems in our guide to RAG systems.