What Is an Embedding?
An embedding is a mapping from a discrete, high-dimensional object (a word, a sentence, a user, a product) to a dense, low-dimensional vector of real numbers.
"cat" → [0.23, -0.81, 0.44, 0.12, ...] # 384 floats
"dog" → [0.21, -0.79, 0.41, 0.08, ...] # similar vector
"car" → [-0.55, 0.33, -0.12, 0.67, ...] # different vector
The key property: similar things have similar vectors. This transforms the problem of semantic similarity into geometric distance — something computers can compute efficiently.
Why Vectors Work for Semantics
Before embeddings, the dominant representation was one-hot encoding:
vocabulary = ["cat", "dog", "car", "boat"]
# One-hot: sparse, no relationships captured
cat = [1, 0, 0, 0]
dog = [0, 1, 0, 0]
# cos_similarity(cat, dog) = 0 — same as cat vs. car!
One-hot vectors can't express "cat and dog are more similar to each other than to car." Embeddings learn these relationships from data.
How Embeddings Are Learned: Word2Vec Intuition
The original insight: words that appear in similar contexts have similar meanings. Word2Vec operationalizes this:
# Training objective: predict surrounding words
# Given "the cat sat on the ___"
# Predict: "mat" (positive), not "democracy" (negative)
# This forces "cat" and "dog" embeddings to be similar
# because both appear in similar contexts
# ("the ___ sat", "pet ___", "feed the ___")
Modern sentence/text embeddings use transformer models (BERT, RoBERTa, E5) trained on much larger datasets with more sophisticated objectives.
Computing Similarity
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def embed(text: str) -> np.ndarray:
"""Placeholder — use real model below."""
pass
# Cosine similarity: angle between vectors (ignores magnitude)
def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Real example with sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"The cat sat on the mat",
"A feline rested on the rug", # Semantically similar
"Machine learning is fascinating", # Different topic
]
embeddings = model.encode(sentences) # Shape: (3, 384)
# Similarity matrix
sim_matrix = cosine_similarity(embeddings)
print(sim_matrix[0, 1]) # 0.83 — very similar
print(sim_matrix[0, 2]) # 0.12 — very different
Vector Search: Finding Similar Items at Scale
For small datasets, brute-force search over all vectors works:
def find_similar(query_embedding, corpus_embeddings, top_k=5):
similarities = cosine_similarity([query_embedding], corpus_embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
return top_indices, similarities[top_indices]
For large datasets (millions+ vectors), approximate nearest neighbor (ANN) libraries are required:
import faiss
import numpy as np
# Build an index
dimension = 384
index = faiss.IndexFlatIP(dimension) # Inner product (= cosine if normalized)
# Normalize embeddings for cosine similarity
def normalize(x):
return x / np.linalg.norm(x, axis=1, keepdims=True)
corpus_embeddings = normalize(model.encode(corpus_texts))
faiss.normalize_L2(corpus_embeddings)
index.add(corpus_embeddings)
# Search
query = normalize(model.encode(["similar cats"]))
distances, indices = index.search(query, k=5)
# Returns top-5 most similar items in microseconds
FAISS (Facebook AI Similarity Search) supports billion-scale vector search. Other options: Pinecone, Weaviate, pgvector (PostgreSQL extension), Qdrant.
Embeddings in Production: Common Patterns
Semantic Search
# Index all documents at index time
docs = load_documents()
doc_embeddings = model.encode([d.text for d in docs])
# At query time
def semantic_search(query: str, top_k: int = 10):
query_embedding = model.encode([query])[0]
indices, scores = find_similar(query_embedding, doc_embeddings, top_k)
return [docs[i] for i in indices], scores
Recommendation via User/Item Embeddings
# Two-tower model: separate encoders for users and items
class TwoTowerModel(nn.Module):
def __init__(self):
super().__init__()
self.user_encoder = nn.Sequential(
nn.Linear(user_feature_dim, 256),
nn.ReLU(),
nn.Linear(256, 128)
)
self.item_encoder = nn.Sequential(
nn.Linear(item_feature_dim, 256),
nn.ReLU(),
nn.Linear(256, 128)
)
def forward(self, user_features, item_features):
user_emb = F.normalize(self.user_encoder(user_features), dim=-1)
item_emb = F.normalize(self.item_encoder(item_features), dim=-1)
return (user_emb * item_emb).sum(dim=-1) # Dot product similarity
Train this model, then index all item embeddings in FAISS. At serving time, encode the user and retrieve top-k nearest item embeddings in milliseconds.
Duplicate Detection
def find_duplicates(texts: list[str], threshold: float = 0.92) -> list[tuple]:
embeddings = model.encode(texts)
sim_matrix = cosine_similarity(embeddings)
duplicates = []
for i in range(len(texts)):
for j in range(i+1, len(texts)):
if sim_matrix[i, j] > threshold:
duplicates.append((i, j, sim_matrix[i, j]))
return duplicates
Clustering
from sklearn.cluster import KMeans
embeddings = model.encode(texts)
kmeans = KMeans(n_clusters=20, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings)
# Inspect clusters
for cluster_id in range(20):
cluster_texts = [texts[i] for i, l in enumerate(cluster_labels) if l == cluster_id]
print(f"Cluster {cluster_id}: {cluster_texts[:3]}")
Choosing an Embedding Model
| Model | Size | Dimension | Best for |
|---|---|---|---|
| all-MiniLM-L6-v2 | 80MB | 384 | Fast, good general purpose |
| all-mpnet-base-v2 | 420MB | 768 | Better quality, slower |
| text-embedding-3-small | API | 1536 | OpenAI, no hosting needed |
| E5-large-v2 | 1.3GB | 1024 | High quality, retrieval-focused |
| BGE-M3 | 2.2GB | 1024 | Multilingual, state-of-the-art |
For most production use cases, start with all-MiniLM-L6-v2 (fast, small, good enough). Upgrade if your evaluation metrics show it's the bottleneck.
The Embedding Space Has Structure
Embeddings encode more than similarity — they encode analogical relationships:
king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin
This arithmetic works because the training objective implicitly encodes these relationships. It's why embeddings are powerful: they learn structure from data without explicit supervision.
Next: see how embeddings are used at scale in production recommendation systems in our guide to RAG systems.