Building a Sovereign Data Platform: Architecture and Lessons Learned

Building a Sovereign Data Platform: Architecture and Lessons Learned

Learn how to build a sovereign data platform with MinIO, PostgreSQL, Airflow and Kubernetes. Complete architecture, technical choices and best practices for private cloud.

In brief: A complete data platform (collection, storage, transformation, reporting) hosted on your servers or with a European host. Zero dependence on American cloud, predictable costs, controlled carbon footprint. Stack: MinIO, PostgreSQL, Airflow, K3s.

Introduction

Your company's business data is scattered across multiple systems. Your teams spend time consolidating manual exports. Your Power BI dashboards are waiting to be fed. And all of this runs on AWS or Azure, with rising costs, questions about data location that remain unanswered, and a carbon footprint that no one really knows how to measure.

At DataKhi, we built a data processing platform entirely hosted on our client's infrastructure, without any dependence on public cloud. This article explains why and how.

Why Do Without Public Cloud

The Legal Problem

The American CLOUD Act, passed in 2018, authorizes American authorities to access data stored by American companies — including when this data is physically hosted in Europe. If your company uses AWS, Azure, or Google Cloud, your data can be subject to an American requisition without you being informed.

Many companies think they're protected because their data is in a European datacenter. This is a confusion between residence and sovereignty. What matters is the provider's immunity to extraterritorial laws. The hyperscalers' "sovereign cloud" initiatives don't solve this problem: the parent company remains subject to American law.

The Economic Problem

Public cloud bills by usage: compute, storage, network transfer, API requests. These costs are difficult to predict and can explode with growing volumes. The real advantage of private cloud isn't necessarily gross savings, it's predictability: you know exactly what your infrastructure costs.

The Environmental Problem

We often forget: the network consumes enormous amounts of energy. Every request to a distant datacenter, every data transfer between your infrastructure and the cloud, every API call — all of this has a carbon cost. Hyperscalers communicate on the energy efficiency of their datacenters (PUE close to 1.1), but they pass over in silence the energy consumed by the network to access them. Moving data across thousands of kilometers has an irreducible cost.

Does a local server consume more than a VM in an optimized datacenter? Probably, per watt of computing. But when your data stays on-premises, you eliminate network round-trips. And above all: you pay attention to it. The consumption of a physical server in your premises is visible, measurable, concrete. You can optimize it, monitor it, consciously decide to turn it off at night if processing is done.

In the cloud, this consumption is abstract, buried in a monthly invoice. The hyperscalers' economic model pushes for overconsumption: it's easier to provision large "just in case" than to optimize finely. Result: allocated resources running idle, forgotten instances, dev environments that stay on over the weekend. On your own infrastructure, every watt counts — and that changes how you design your systems.

The Target Architecture

We implemented a layered architecture:

Sources → Data Lake (MinIO) → Data Warehouse (PostgreSQL) → Analytics (Power BI)

This separation allows preserving raw data (if a business rule changes, we recalculate from the origin), tracing each transformation (we know where each number comes from), and adding new sources without questioning existing ones.

The whole runs on a lightweight Kubernetes cluster (K3s), deployed via Ansible, with a private Docker registry. No external dependencies: even if the Internet is cut, the pipeline continues to function.

The Technical Building Blocks

Object Storage: MinIO

MinIO plays the role of data lake. It's an object storage system compatible with the Amazon S3 API, but that you install on your own servers.

Why MinIO: it's currently the most complete S3-compatible implementation on the market. Stable, proven in production for years, with an active community. The organization of data into buckets and prefixes allows exploiting wildcards with Spark or other processing engines, which considerably simplifies data pipelines.

In late December 2025, MinIO announced the transition of its community edition to maintenance mode. Concretely, this means a stable and mature base to rely on, with maintained security patches. For a production infrastructure, it's rather reassuring: no breaking changes to manage, no race for new features.

Data Warehouse: PostgreSQL

PostgreSQL serves as the data warehouse. It's a choice that may surprise in a world where specialized analytical engines multiply. Our reasoning follows the philosophy of the "Choose Boring Technology" manifesto: each technology has a learning and maintenance cost, and this cost is paid over time. Each additional service is also one more process running, RAM mobilized, CPU cycles consumed. Consolidating on PostgreSQL means doing more with less.

PostgreSQL is universal. Everyone knows it, everyone knows how to operate it, everyone knows how to query it. The documentation is exhaustive, common problems have been documented for years, and skills are easy to find on the market.

The plugin ecosystem extends its capabilities: PostGIS for geospatial, pg_cron for scheduling, timescaledb for time series. And if one day performance reaches a plateau, you can complement with a Redis cache or add an analytical engine for reading. PostgreSQL remains the source of truth, the rest is added on top.

The data schema follows the star model, classic in Business Intelligence: dimension tables linked to fact tables, partitioned by client. This model optimizes analytical query performance and simplifies report creation in Power BI.

Orchestration: Apache Airflow on K3s

A data pipeline isn't just scripts. It's a sequence of tasks with dependencies: deduplication can't start before collection is complete, warehouse loading depends on staging.

Apache Airflow orchestrates this sequence. Each pipeline is defined as a task graph (DAG) in Python. Airflow schedules executions, respects dependencies between tasks, automatically restarts failed tasks, and alerts in case of problems. The web interface allows visualizing pipeline status in real-time.

Airflow runs on K3s, a Kubernetes distribution developed by Rancher Labs (now SUSE). K3s isn't a "lightweight" version in the sense of "limited" — it's a more understandable, more maintainable, easier to install and configure version. A single binary under 100 MB, one-command installation, and a memory footprint well below a standard Kubernetes. Where a standard K8s cluster consumes several gigabytes of RAM just for the control plane, K3s runs comfortably with 512 MB. Rancher comes with it, which immediately gives a powerful monitoring interface to visualize cluster state, pods, logs.

Each pipeline task executes in its own Docker container, isolated from others. If a task crashes or consumes too much memory, it doesn't affect the others. Multiple clients can be processed simultaneously, each in its container.

Collection: Adapting to Existing Sources

Data sources in companies are rarely homogeneous. REST APIs, web interfaces without export, Excel files on a network share, collaborative Google Sheets, legacy databases — each system has its logic.

The chosen approach: one connector per source, encapsulated in an Airflow task. Whether the data comes from an API, web scraping with Playwright, an SFTP file, or a Google Sheet, the principle remains the same: extract, convert to Parquet, deposit in the data lake. The DAG orchestrates the whole and manages dependencies.

This modularity allows adding a new source without touching the rest of the pipeline. Just write the connector and integrate it into the task graph.

Infrastructure as Code: Ansible

The infrastructure isn't installed manually. Everything is described in configuration files, versioned in Git, and automatically deployed with Ansible.

An Ansible playbook describes the desired state of the infrastructure: which services to install, which configuration to apply, which secrets to inject. This approach brings three major benefits.

Resilience first: in case of major incident (hardware failure, corruption), the complete environment can be rebuilt in one command on a new machine. No obsolete documentation to follow, no "I don't remember how it was configured".

Reproducibility next: we can duplicate the infrastructure identically for a test environment, development, or for a new client. Same configuration, guaranteed same behavior.

Scalability finally: adding a node to the cluster, deploying to a new site, replicating the infrastructure at another host — everything is done by modifying a few variables and relaunching the playbook. The infrastructure becomes code, with all the advantages that implies: versioning, code review, rollback.

The Pipeline in Practice

Every morning, the pipeline triggers automatically. Here's what happens, illustrated by our client case.

The DAG starts by retrieving the list of clients and their connection parameters to source systems. For each client, collection tasks execute in parallel: one Kubernetes pod per client, per source. Raw data is deposited in the data lake in Parquet format.

Then comes the deduplication phase. Extractions can overlap from one day to the next. The pipeline identifies duplicates and keeps only the most recent version. Output: one clean file per day and per client.

Transformation applies business rules. Technical statuses become categories understandable by end users. These rules are explicitly coded, versioned, testable. They exactly reproduce the logic that was previously buried in Excel formulas or Power Query queries.

Warehouse loading feeds the PostgreSQL tables. Loading is idempotent: rerunning the pipeline for an already processed date doesn't create duplicates. Finally, pre-calculated aggregations accelerate dashboards.

The whole takes a few minutes. Dashboards are fed with fresh data. In case of anomaly, an alert is sent. The client team can follow execution via the Airflow or Rancher interface.

Possible Extensions

The described architecture covers daily batch reporting. Depending on needs, it can evolve in several directions.

Real-time. The current pipeline runs once a day. For streaming or event-driven, we would add Apache Kafka or Redpanda as a message broker, with consumers feeding the warehouse continuously. The layered architecture remains the same, only the frequency changes.

Machine Learning. The stack is optimized for BI reporting. To industrialize ML models, we would add MLflow for experiment tracking and model registry, potentially a feature store to share features between teams. PostgreSQL remains the source of truth, models consume data via views or exports.

High availability. The current architecture runs on a single node. For high availability, K3s natively supports multi-node mode with PostgreSQL or etcd as backend. MinIO can be deployed in distributed mode across multiple servers. Infrastructure as Code facilitates this scaling.

Advanced governance. For cataloging, lineage, and data quality needs, tools like Apache Atlas, DataHub, or Great Expectations integrate naturally with this stack.

Sovereign reporting. The architecture exposing standard connectors, you can plug any reporting or analytics tool into it. If you want to go all the way with the sovereign approach, open source and self-hosted solutions exist for reporting and data visualization. An alternative to American cloud tools, hosted on the same infrastructure as your data.

Who This Approach Is For

This architecture is for organizations that need to control their data: location, access, costs, environmental impact. It's particularly suitable for sensitive sectors (healthcare, finance, public sector), companies with predictable volumes, those who prioritize technological independence, and organizations engaged in a CSR approach who want to measure and reduce the carbon footprint of their digital infrastructure.

It requires skills to operate the infrastructure — skills that can be internalized progressively or outsourced via a support contract. The advantage of these skills: they're transferable. PostgreSQL, Airflow, Kubernetes apply to all contexts, unlike expertise on a proprietary service.

Conclusion

Building a sovereign data platform in 2025 isn't a step backward. MinIO, PostgreSQL, Airflow, Kubernetes are no longer experimental projects: they power some of the world's largest infrastructures. The difference is that you can deploy them on your machines, in your datacenter, under your control.

The principle remains the same as with cloud giants: collect, store, transform, distribute. But on an infrastructure you control, with predictable costs, a carbon footprint you can measure and optimize, and without dependence on providers subject to jurisdictions that aren't yours.

Need help building your sovereign data platform? Contact our experts to discuss your project.

Sources and Documentation

Regulatory and Environmental Context

Architecture and Philosophy

Tools