tutorial 2025-04-01 14 min read

Vector Databases in Production: Choosing and Scaling Pinecone, Weaviate, and pgvector

A practical guide to selecting and operating vector databases at scale. Compare Pinecone, Weaviate, Qdrant, and pgvector for production RAG and semantic search systems.

vector database Pinecone Weaviate pgvector RAG semantic search embeddings

Why Vector Databases Matter

As LLM-powered applications move to production, vector databases have become the backbone of retrieval-augmented generation (RAG), semantic search, and recommendation systems. Choosing the wrong one—or misconfiguring the right one—can turn a working prototype into an unscalable mess.

This guide covers the four options you'll actually encounter in production: Pinecone, Weaviate, Qdrant, and pgvector.


The Core Problem: ANN at Scale

All vector databases solve the same fundamental problem: given a query vector, find the k most similar vectors in a collection of millions or billions of records efficiently.

Exact nearest-neighbor search is O(n·d) per query—too slow at scale. Every serious vector database uses an approximate nearest neighbor (ANN) index. The two dominant families are:

  • HNSW (Hierarchical Navigable Small World): Graph-based, high recall, fast queries, expensive to build and store
  • IVF (Inverted File Index): Cluster-based, faster to build, slightly lower recall at the same memory budget

Understanding this distinction explains most of the performance tradeoffs you'll encounter.


Option 1: Pinecone

Pinecone is a fully managed, serverless vector database. You don't manage infrastructure—you create an index and call an API.

When to use Pinecone

  • Startup or team without dedicated infrastructure engineers
  • Need to be in production within a day
  • Budget isn't the primary constraint

Key limits to know

Serverless (recommended tier):
  - Max vector dimensions: 20,000
  - Max metadata per vector: 40 KB
  - Max index size: limited by spend, not hard ceiling

Pod-based (legacy):
  - p1.x1: 1M vectors of dim-768 per pod
  - s1.x1: 5M vectors of dim-768 per pod

Upsert and query pattern

import pinecone
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_KEY")

# Create index
pc.create_index(
    name="documents",
    dimension=1536,  # text-embedding-3-small
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

index = pc.Index("documents")

# Upsert vectors with metadata
vectors = [
    ("doc-001", embedding_1, {"text": "...", "source": "wiki"}),
    ("doc-002", embedding_2, {"text": "...", "source": "internal"}),
]
index.upsert(vectors=vectors, namespace="v1")

# Query
results = index.query(
    vector=query_embedding,
    top_k=10,
    include_metadata=True,
    filter={"source": {"$eq": "wiki"}},
    namespace="v1",
)

Production gotchas

  • Namespace isolation is critical for multi-tenant apps—don't skip it
  • Metadata filtering is applied post-ANN, which can tank recall if filters are too selective
  • Batch upserts of 100 vectors perform dramatically better than single-vector upserts

Option 2: Weaviate

Weaviate is an open-source vector database with native multi-modal support and a GraphQL API. It can run self-hosted or on Weaviate Cloud.

Architecture overview

Weaviate organizes data into classes (like tables). Each class can have its own vectorizer module (e.g., OpenAI, Cohere, a local model) or accept pre-computed vectors.

import weaviate
from weaviate.classes.config import Configure, Property, DataType

client = weaviate.connect_to_local()

# Create collection
client.collections.create(
    name="Document",
    vectorizer_config=Configure.Vectorizer.none(),  # we'll provide vectors
    properties=[
        Property(name="text", data_type=DataType.TEXT),
        Property(name="source", data_type=DataType.TEXT),
        Property(name="created_at", data_type=DataType.DATE),
    ],
)

collection = client.collections.get("Document")

# Batch insert
with collection.batch.dynamic() as batch:
    for doc in documents:
        batch.add_object(
            properties={"text": doc.text, "source": doc.source},
            vector=doc.embedding,
        )

# Hybrid search (BM25 + vector)
results = collection.query.hybrid(
    query="machine learning infrastructure",
    alpha=0.75,  # 0 = pure BM25, 1 = pure vector
    limit=10,
)

Why Weaviate stands out

Hybrid search is first-class. Combining sparse (BM25) and dense (vector) retrieval in a single query beats pure vector search for most knowledge-base use cases by 5–15% on NDCG@10.

When to use Weaviate

  • Need hybrid search out of the box
  • Multi-modal data (text + images)
  • Want self-hosted with a managed option available

Option 3: Qdrant

Qdrant is a high-performance, open-source vector search engine written in Rust. It's gaining adoption for latency-critical applications.

Payload filtering done right

Qdrant's killer feature is payload-filtered search: filters are applied during ANN traversal, not after. This means filtered queries are nearly as fast as unfiltered ones.

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct,
    Filter, FieldCondition, MatchValue,
)

client = QdrantClient(url="http://localhost:6333")

client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Upsert
client.upsert(
    collection_name="docs",
    points=[
        PointStruct(id=1, vector=embedding, payload={"source": "wiki", "year": 2024}),
    ],
)

# Filtered search — filter applied DURING HNSW traversal
results = client.search(
    collection_name="docs",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[FieldCondition(key="source", match=MatchValue(value="wiki"))]
    ),
    limit=10,
)

When to use Qdrant

  • Highly selective metadata filters on large collections
  • Latency-critical workloads (p99 under 20ms)
  • On-prem or air-gapped environments

Option 4: pgvector

pgvector is a PostgreSQL extension. If you're already on Postgres, it's the lowest-friction path to vector search.

-- Install
CREATE EXTENSION vector;

-- Create table
CREATE TABLE documents (
  id BIGSERIAL PRIMARY KEY,
  content TEXT,
  source TEXT,
  embedding VECTOR(1536)
);

-- HNSW index (faster queries, more memory)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Query
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE source = 'wiki'
ORDER BY embedding <=> $1::vector
LIMIT 10;

The transactional advantage

pgvector shines when you need ACID semantics across vector and relational data. Updating a document and its embedding atomically is trivial in Postgres. In a dedicated vector DB, you'd need to coordinate two separate writes.

Performance ceiling

pgvector's HNSW index tops out around 5–10M vectors before query latency degrades noticeably. Beyond that, you're looking at partitioning or a dedicated vector database.

When to use pgvector

  • Existing Postgres infrastructure
  • < 5M vectors
  • Need transactional consistency between vector and relational data
  • Minimize operational complexity

Head-to-Head Comparison

Dimension Pinecone Weaviate Qdrant pgvector
Operational overhead None (managed) Low–Medium Low None (if on Postgres)
Filtered search Post-ANN Post-ANN During ANN Post-ANN
Hybrid search No Yes (native) Yes Via RRF manually
Scale ceiling Effectively unlimited ~100M+ ~100M+ ~5–10M
Latency (p99 @1M vecs) ~30ms ~20ms ~10ms ~15ms
Open source No Yes Yes Yes
Best fit Managed simplicity Hybrid search Filter-heavy workloads Postgres shops

Production Configuration Checklist

Regardless of which database you choose:

  • Set ef_search / nprobe: Higher values improve recall but increase latency. Start at ef_search=100 and tune
  • Pre-filter selectivity: If > 50% of vectors match a filter, post-ANN filtering is fine; below 10%, you need Qdrant-style during-traversal filtering
  • Dimension reduction: If using OpenAI's text-embedding-3-large (3072 dims), consider truncating to 1536 or 512 for 2–4x storage savings with minimal recall loss
  • Monitoring: Track p99 query latency and recall@k over time—both degrade as index grows

Building a RAG system on top of your vector store? Read our guide on RAG Systems at Scale.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.