tutorial 2025-03-11 13 min read

Embeddings: The Data Structures Powering Modern AI

Understand embeddings from first principles. Learn how vectors represent meaning, how similarity search works, and how to use embeddings in production ML systems.

embeddings vector search NLP semantic search representation learning

What Is an Embedding?

An embedding is a mapping from a discrete, high-dimensional object (a word, a sentence, a user, a product) to a dense, low-dimensional vector of real numbers.

"cat"  → [0.23, -0.81, 0.44, 0.12, ...]   # 384 floats
"dog"  → [0.21, -0.79, 0.41, 0.08, ...]   # similar vector
"car"  → [-0.55, 0.33, -0.12, 0.67, ...]  # different vector

The key property: similar things have similar vectors. This transforms the problem of semantic similarity into geometric distance — something computers can compute efficiently.

Why Vectors Work for Semantics

Before embeddings, the dominant representation was one-hot encoding:

vocabulary = ["cat", "dog", "car", "boat"]

# One-hot: sparse, no relationships captured
cat = [1, 0, 0, 0]
dog = [0, 1, 0, 0]
# cos_similarity(cat, dog) = 0  — same as cat vs. car!

One-hot vectors can't express "cat and dog are more similar to each other than to car." Embeddings learn these relationships from data.

How Embeddings Are Learned: Word2Vec Intuition

The original insight: words that appear in similar contexts have similar meanings. Word2Vec operationalizes this:

# Training objective: predict surrounding words
# Given "the cat sat on the ___"
# Predict: "mat" (positive), not "democracy" (negative)

# This forces "cat" and "dog" embeddings to be similar
# because both appear in similar contexts
# ("the ___ sat", "pet ___", "feed the ___")

Modern sentence/text embeddings use transformer models (BERT, RoBERTa, E5) trained on much larger datasets with more sophisticated objectives.

Computing Similarity

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def embed(text: str) -> np.ndarray:
    """Placeholder — use real model below."""
    pass

# Cosine similarity: angle between vectors (ignores magnitude)
def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Real example with sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",     # Semantically similar
    "Machine learning is fascinating",  # Different topic
]

embeddings = model.encode(sentences)  # Shape: (3, 384)

# Similarity matrix
sim_matrix = cosine_similarity(embeddings)
print(sim_matrix[0, 1])  # 0.83 — very similar
print(sim_matrix[0, 2])  # 0.12 — very different

Vector Search: Finding Similar Items at Scale

For small datasets, brute-force search over all vectors works:

def find_similar(query_embedding, corpus_embeddings, top_k=5):
    similarities = cosine_similarity([query_embedding], corpus_embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k]
    return top_indices, similarities[top_indices]

For large datasets (millions+ vectors), approximate nearest neighbor (ANN) libraries are required:

import faiss
import numpy as np

# Build an index
dimension = 384
index = faiss.IndexFlatIP(dimension)  # Inner product (= cosine if normalized)

# Normalize embeddings for cosine similarity
def normalize(x):
    return x / np.linalg.norm(x, axis=1, keepdims=True)

corpus_embeddings = normalize(model.encode(corpus_texts))
faiss.normalize_L2(corpus_embeddings)
index.add(corpus_embeddings)

# Search
query = normalize(model.encode(["similar cats"]))
distances, indices = index.search(query, k=5)
# Returns top-5 most similar items in microseconds

FAISS (Facebook AI Similarity Search) supports billion-scale vector search. Other options: Pinecone, Weaviate, pgvector (PostgreSQL extension), Qdrant.

Embeddings in Production: Common Patterns

Semantic Search

# Index all documents at index time
docs = load_documents()
doc_embeddings = model.encode([d.text for d in docs])

# At query time
def semantic_search(query: str, top_k: int = 10):
    query_embedding = model.encode([query])[0]
    indices, scores = find_similar(query_embedding, doc_embeddings, top_k)
    return [docs[i] for i in indices], scores

Recommendation via User/Item Embeddings

# Two-tower model: separate encoders for users and items
class TwoTowerModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.user_encoder = nn.Sequential(
            nn.Linear(user_feature_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128)
        )
        self.item_encoder = nn.Sequential(
            nn.Linear(item_feature_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128)
        )

    def forward(self, user_features, item_features):
        user_emb = F.normalize(self.user_encoder(user_features), dim=-1)
        item_emb = F.normalize(self.item_encoder(item_features), dim=-1)
        return (user_emb * item_emb).sum(dim=-1)  # Dot product similarity

Train this model, then index all item embeddings in FAISS. At serving time, encode the user and retrieve top-k nearest item embeddings in milliseconds.

Duplicate Detection

def find_duplicates(texts: list[str], threshold: float = 0.92) -> list[tuple]:
    embeddings = model.encode(texts)
    sim_matrix = cosine_similarity(embeddings)

    duplicates = []
    for i in range(len(texts)):
        for j in range(i+1, len(texts)):
            if sim_matrix[i, j] > threshold:
                duplicates.append((i, j, sim_matrix[i, j]))
    return duplicates

Clustering

from sklearn.cluster import KMeans

embeddings = model.encode(texts)
kmeans = KMeans(n_clusters=20, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings)

# Inspect clusters
for cluster_id in range(20):
    cluster_texts = [texts[i] for i, l in enumerate(cluster_labels) if l == cluster_id]
    print(f"Cluster {cluster_id}: {cluster_texts[:3]}")

Choosing an Embedding Model

Model Size Dimension Best for
all-MiniLM-L6-v2 80MB 384 Fast, good general purpose
all-mpnet-base-v2 420MB 768 Better quality, slower
text-embedding-3-small API 1536 OpenAI, no hosting needed
E5-large-v2 1.3GB 1024 High quality, retrieval-focused
BGE-M3 2.2GB 1024 Multilingual, state-of-the-art

For most production use cases, start with all-MiniLM-L6-v2 (fast, small, good enough). Upgrade if your evaluation metrics show it's the bottleneck.

The Embedding Space Has Structure

Embeddings encode more than similarity — they encode analogical relationships:

king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin

This arithmetic works because the training objective implicitly encodes these relationships. It's why embeddings are powerful: they learn structure from data without explicit supervision.


Next: see how embeddings are used at scale in production recommendation systems in our guide to RAG systems.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.