Data Engineering January 27, 2026 15 min read

Recommendation Systems: Which Database Architecture Should You Choose?

Graph DB vs Vector DB comparison for recommendation systems. Real-world experience with PostgreSQL and pgvector on an eco-friendly fashion aggregator.

RecommendationpgvectorPostgreSQLMachine LearningNLP

Nathan ML OPS

Introduction

Your users browse your catalog without finding anything. They scroll, click, leave. Meanwhile, your competitors display "Customers who bought X also liked Y" and convert. Recommendation systems are no longer a luxury: at e-commerce leaders, they generate an average of 24% of orders. Netflix estimates that 80% of content watched is discovered via its recommendations.

But behind these statistics hides a technical question rarely addressed: how to store and query the data that feeds these recommendations? Graph databases, vector databases, hybrids — each approach has its advantages and limitations.

In this article, we start from a concrete case — Tossée, an eco-responsible fashion aggregator — to explore the database architectures behind a recommendation system. The challenge: recommending ecological alternatives from poorly structured textual data, sourced from dozens of different vendors. We started with a graph approach, then evolved toward vector-based storage. But why not a hybrid approach? What do the major platforms do, and what can we learn from them at our scale?

Use Cases

A recommendation system becomes relevant when several of these signals appear:

Your catalog exceeds a few hundred references. Below that, manual navigation is enough. Beyond that, your users get lost.
Your conversion rate stagnates despite traffic. Visitors come but don't find what they're looking for — or don't know they're looking for it.
Your product data is rich. Detailed descriptions, structured attributes, multiple categories: the more material you have, the more relevant recommendations will be.
You have user history. Purchases, clicks, abandoned carts — these interactions feed collaborative filtering.
Your teams spend time manually "pushing" products. An automated system frees up this time and personalizes at scale.

On the Tossée project, several of these signals were present: a catalog of tens of thousands of references from multiple vendors, rich but poorly structured product data, and a need for automated recommendations to guide consumers toward eco-responsible alternatives.

Why Database Choice Matters

The Scalability Problem

A recommendation system can work in two ways. The first: pre-compute all recommendations in advance, store them, and serve them directly. Simple, fast to read, but costly to maintain. Each new product or new user interaction requires a recalculation.

The second: calculate on the fly. The user arrives, the system analyzes their profile in real-time, queries the database, and returns personalized suggestions. More flexible, but performance depends directly on the database's ability to respond in a few milliseconds.

The choice of storage architecture determines which of these approaches is realistic for your use case.

The Maintenance Problem

An e-commerce catalog evolves constantly. New products, removed items, price changes, description updates. The recommendation system must integrate these changes without requiring a complete rebuild.

Some architectures handle incremental addition well. Others require a global recalculation with each significant modification. On a catalog of 50,000 references, this difference can represent hours of daily processing — or a few seconds.

Content-based vs Collaborative Filtering

Before talking about databases, let's recall the two main families of recommendation algorithms.

Content-based filtering analyzes product characteristics: category, brand, description, attributes. If you bought a blue linen shirt, the system suggests other linen shirts or other blue clothes.

Collaborative filtering analyzes the behaviors of similar users. If users who bought the same items as you also bought a specific pair of pants, the system suggests it — even if those pants have no common attribute with your previous purchases.

These two approaches have different implications for storage. Content-based manipulates feature vectors. Collaborative manipulates relationships between entities (users, products, interactions). The database choice depends on the preferred approach — or the ability to combine both.

In the case of Tossée, the data was primarily textual (product descriptions, materials, categories) and user history was virtually non-existent at launch. This naturally pointed toward a content-based approach — with direct consequences on storage choice.

The Two Main Storage Families

Graph Databases

A graph database stores information as nodes and edges. Nodes represent entities (users, products, categories). Edges represent relationships between these entities (purchased, belongs to, is similar to).

This structure excels for collaborative filtering. Finding "products purchased by similar users" translates into a graph traversal: start from the user, follow their purchases, go back to other buyers of these products, follow their other purchases. This is exactly what graph databases are optimized for.

Advantages: recommendations are explainable. You can say "Recommended because customers like you also bought this item". Multi-hop traversals allow sophisticated recommendations (friends of friends, complementary products of similar products).

Limitations: relationships must be explicitly calculated and stored. Adding a new product to the catalog requires calculating its relationships with all existing products. On a large catalog, this recalculation can become prohibitive. This is exactly the problem we encountered on Tossée: with tens of thousands of references and regular additions, relationship recalculation became a bottleneck. The graph approach did have an advantage: explicit relationships made it possible to audit recommendations — understanding why a given product was suggested over another. But we were maintaining two databases in parallel (graph for relationships, document store for products), a constant source of bugs and operational complexity.

Vector Databases

A vector database stores each entity as a vector in a high-dimensional space. A product becomes a point in a 768-dimensional space (typically). Similarity between products is measured by the distance between their vectors — usually cosine similarity.

To recommend similar products, we calculate the vector of the viewed product, then search for the k nearest vectors (k-NN, k-nearest neighbors). Indexing algorithms like HNSW (Hierarchical Navigable Small World) allow this search in a few milliseconds, even on millions of vectors.

Advantages: calculation is done on the fly. A new product is immediately available for recommendation as soon as its vector is calculated — no need to recalculate its relationships with the entire catalog. Embeddings capture semantics: two products with similar descriptions will have close vectors, even without explicit relationship.

Limitations: explainability is more difficult. Why are two vectors close? The 768-dimensional space isn't interpretable by a human. Relationship-based recommendations (bought together, viewed together) require a complementary approach.

For Tossée, it was this vector approach that prevailed. Product additions were very regular, and the graph model required recalculation with each addition. With the vector approach, a new product is immediately comparable to others as soon as its embedding is calculated — no global recalculation needed.

In Practice: pgvector and PostgreSQL

Within the vector family, one solution stands out for its ease of adoption.

pgvector is a PostgreSQL extension that adds support for vectors and similarity search. Version 0.8.1 at the time of this article, with over 19,000 stars on GitHub — a sign of maturity and adoption.

Why pgvector rather than a dedicated vector database like Pinecone or Weaviate? The answer is one word: simplicity. If your infrastructure already relies on PostgreSQL, pgvector is added as an extension. No new database to operate, no synchronization between systems, no new technology to learn. And an often overlooked advantage: fewer services to run means fewer resources consumed.

Here's what a recommendation query looks like with pgvector:

SELECT id, name, description
FROM products
ORDER BY embedding <=> '[0.12, -0.34, 0.56, ...]'::vector
LIMIT 10;

The <=> operator calculates cosine distance. The query returns the 10 products whose embeddings are closest to the given vector. With an HNSW index, this query executes in a few milliseconds on hundreds of thousands of products.

This is the solution we chose for Tossée. To generate embeddings, we use CamemBERT — specifically sentence-camembert-large (336 million parameters), a model trained on French text and optimized for semantic similarity. The choice of a French model matters: clothing descriptions contain domain-specific vocabulary (jersey, viscose, ribbed knit) that a general multilingual model captured poorly.

The result: two products described as "navy blue organic cotton t-shirt" and "blue organic cotton top" have very close embeddings, without needing to explicitly define this relationship. A new product is available for recommendation in seconds — embedding calculation, PostgreSQL insertion, that's it.

In practice, comparing a vector against the entire catalog isn't always necessary. On Tossée, we apply pre-filtering before the vector search: filtering by gender, dominant color, or clothing category significantly reduces the number of comparisons. The SQL query then combines a standard WHERE clause with the vector distance operator — pgvector handles this combination natively.

And a collateral benefit: by consolidating everything in PostgreSQL and calculating embeddings locally, we avoid network round-trips to third-party APIs. For a platform whose mission is to reduce environmental impact, less infrastructure also means less energy consumed.

Beyond the Binary: Hybrid Approaches

In reality, graph and vectors aren't the only options. Between these two families, a whole range of hybrid approaches exists.

A hybrid architecture combines both paradigms: the graph captures explicit relationships (purchases, categories, links between users), vectors capture semantic similarity (descriptions, textual attributes).

Concretely, this can take several forms. The simplest: use the graph for collaborative filtering ("users like you liked") and vectors for content-based ("products similar to this one"). A more sophisticated approach enriches embeddings with information from the graph — this is the principle of Graph Neural Networks (GNN), where a node's vector representation integrates the characteristics of its neighbors.

The GraphRAG approach pushes this logic further by combining knowledge graphs, vector search, and language models. The graph structures information, embeddings enable semantic search, and the LLM generates contextualized responses.

These architectures offer the best of both worlds — at the price of increased complexity. They're justified when data is rich and heterogeneous, or when quality requirements are very high. For Tossée, we didn't need this complexity — but these hybrid principles inspire how we design our systems: start simple, keep the option to enrich the architecture as needs evolve.

What the Giants Do

The previous sections present graph, vector, and hybrid approaches as distinct options. In practice, major platforms don't limit themselves to one paradigm: they build complete systems that combine multiple approaches, powered by data volumes and teams that are beyond the ordinary. An overview of these architectures helps understand where the field is converging — and why these solutions remain out of reach for most projects.

Netflix presented at its PRS 2025 workshop its "Hydra" architecture: a multi-task learning system that consolidates different ranking models into a single shared model. The goal: simplify an infrastructure that had become too complex, where each "row" of the homepage had its own pipeline. Netflix is also developing a central Foundation Model that learns user preferences and content characteristics from all available data — an approach that combines collaborative and content-based filtering at massive scale, exactly the hybrid described above, but with incomparable resources.

Spotify published at RecSys 2025 several significant advances. Their AudioBoost system uses synthetic queries generated by LLM to solve the cold-start problem on audiobooks — a typically content-based problem, solved here through vector generation. Their Text2Tracks research (April 2025) explores using generative models for recommendations based on natural language prompts.

The 2025-2026 trend is toward convergence. Microsoft Research evolved GraphRAG with LazyGraphRAG (June 2025), an approach that reduces indexing costs by 99.9% compared to GraphRAG while maintaining comparable quality. The principle: combine best-first and breadth-first search iteratively, deferring LLM usage to maximize efficiency.

These architectures are impressive — and disproportionate for most projects. But they're a source of inspiration: the underlying principles (combining multiple signals, iterating on recommendation quality, simplifying infrastructure) apply at more modest scales. On Tossée, we adopted this logic of progressive simplification rather than maximum sophistication.

Summary: Graph vs Vector

Here's what we observed concretely on the Tossée project:

Criteria	Graph database	Vector database (pgvector)
Semantic similarity	Low (explicit relationships only)	Strong (embeddings capture meaning)
Explainability	Excellent (named relationships)	Limited (vector distance)
Adding products	Recalculation necessary	Instantaneous
Maintenance	Complex (multi-database sync)	Simple (single database)
Tech stack	Graph database + document database	PostgreSQL alone

The loss of traceability was an accepted tradeoff. With the graph, explicit relationships made recommendation auditing straightforward. With vectors, auditing relies on analyzing distances and common attributes — less immediate, but sufficient for our needs.

Decision Guide

Which architecture to choose for your project? Here are our recommendations based on common use cases.

Choose a Graph Database When...

Explicit relationships are at the heart of the business. Social network (friends of friends), referral system, complex category trees.
Explainability is a regulatory requirement. Some sectors (finance, healthcare) require being able to explain why a recommendation was made. Graph databases excel in this area.
Multi-hop traversals are frequent. "Products bought by customers who also bought what I bought" — this type of query is natural in graph, complex in vector.

Choose a Vector Database When...

Data is primarily textual. Product descriptions, articles, editorial content. Embeddings capture semantics without structuring effort.
The catalog evolves frequently. E-commerce with daily additions, marketplace, content catalog. Instant addition is a decisive advantage.
You already use PostgreSQL. pgvector is added without revolutionizing your stack. No new database to operate, no skills to acquire.
Budget is constrained. A single database to maintain, less operational complexity, lower infrastructure costs.

Consider a Hybrid Approach When...

Both types of data coexist. User-product relationships (graph) + semantic similarity between products (vectors).
You're integrating an LLM. The GraphRAG approach combines knowledge graphs and vector search to feed language models.
Requirements evolve. Start simple (vector), enrich with explicit relationships if the need is confirmed.

Conclusion

The choice between graph database and vector database isn't a theoretical debate. It's an architectural decision that impacts maintainability, performance, and costs of your recommendation system.

Our experience on the Tossée project taught us that operational simplicity matters as much as raw performance. An architecture you know how to maintain is worth more than an optimal architecture you struggle to operate.

pgvector and PostgreSQL aren't the answer to all use cases. But for a content-based recommendation system on textual data, with a catalog that evolves frequently, it's a pragmatic and proven choice.

Need help with your recommendation system? Contact our experts to discuss your project.

Recommendation Systems: Which Database Architecture Should You Choose?

Introduction

Use Cases

Why Database Choice Matters

The Scalability Problem

The Maintenance Problem

Content-based vs Collaborative Filtering

The Two Main Storage Families

Graph Databases

Vector Databases

In Practice: pgvector and PostgreSQL

Beyond the Binary: Hybrid Approaches

What the Giants Do

Summary: Graph vs Vector

Decision Guide

Choose a Graph Database When...

Choose a Vector Database When...

Consider a Hybrid Approach When...

Conclusion

Sources and Documentation

Context and Statistics

Vector Databases

Hybrid Approaches and Graph Neural Networks

Industrial Use Cases

French NLP

Technical Documentation

Our experts are here to help