Every morning I get an email with the ten most relevant articles from my RSS feeds, summarised in French, sorted by topic, with a clickable link for each one. Five minutes of reading. No paid subscription. No data sent to any US tech giant.
Here’s how I built it — and why digital sovereignty was a non-negotiable requirement.
The problem: too much information, not enough signal
I follow about a hundred RSS feeds: tech blogs, Red Hat and OpenShift news, home automation, AI tracking… That’s between 100 and 300 new articles every day. Nobody reads that.
The usual solutions didn’t work for me:
- Feedly, Inoreader: cloud services that host your reading data, sell AI as a premium feature, and on which you are entirely dependent
- Newsletters: human editorial curation, with editorial bias and a fixed schedule you don’t control
- Reading FreshRSS directly: effective but time-consuming, with no automatic prioritisation
What I wanted was a tool that knows my interests, reads for me, gives me a summary of what matters — and runs on my own infrastructure.
The sovereignty requirement
I work daily with internal technical data: architectures, configurations, projects. Sending any of that to OpenAI or any US cloud service subject to the Cloud Act was out of the question.
My constraints:
- Self-hosted for critical components (database, orchestration)
- AI provided by a European vendor, with data staying in Europe
- No vendor lock-in on the embedding model
My answer: n8n + PostgreSQL/pgvector + Infomaniak AI.
The architecture
FreshRSS (self-hosted)
↓ GReader API
n8n (self-hosted) — Phase 1: Ingestion
↓ HTTP Request → Infomaniak AI (embeddings)
↓ SQL INSERT
PostgreSQL + pgvector (self-hosted)
↓
n8n — Phase 2: Digest (triggered automatically)
↓ SQL SELECT (cosine similarity)
↓ HTTP Request → Infomaniak AI (LLM summary)
↓ SMTP
Daily email
Everything runs on my own infrastructure. The only external call goes to Infomaniak, a Swiss hosting provider with infrastructure in Europe and native GDPR compliance.
The components
FreshRSS — the sovereign aggregator
FreshRSS is an open source, self-hostable RSS aggregator. It exposes a Google Reader-compatible API (GReader), making it easy to drive programmatically.
I use it to centralise my subscriptions into categories: IT News, Home Automation, etc. It’s my single entry point.
n8n — the orchestrator
n8n is an open source, self-hostable workflow automation platform. Think Zapier or Make, but without the privacy limitations of a SaaS product.
The workflow runs in two phases that chain together automatically:
Phase 1 — Ingestion (7 AM)
- Authenticate with FreshRSS via the GReader API
- Fetch articles from the last 24 hours, by category
- Strip HTML, detect language (FR/EN by stopword counting)
- In-memory deduplication
- Generate one embedding per article via Infomaniak AI
- Insert into PostgreSQL with database-level deduplication (
WHERE NOT EXISTS)
Phase 2 — Digest (chained)
- A list of interest topics defined in a
Setnode - Each interest is embedded separately
- Cosine similarity SQL query per interest (
LIMIT 5each) - Aggregate, deduplicate, keep top 10
- HTML summary generated by the LLM
- Sent by SMTP
PostgreSQL + pgvector — the vector memory
pgvector is a PostgreSQL extension that adds a VECTOR type and distance operators. It turns Postgres into a vector database with no additional infrastructure.
The table is minimal:
CREATE TABLE rss_articles (
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
text TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
embedding VECTOR(3584)
);
The metadata column stores the title, URL, source, language, and Unix timestamp. The embedding column stores the 3584-dimensional vector.
Infomaniak AI — the brain, in Europe
Infomaniak is a Swiss hosting provider offering an OpenAI-compatible AI API. Two models are used here:
- BGE-Multilingual-Gemma2 (9B params, 3584 dims) for embeddings — natively multilingual FR/EN
- Gemma 4 31B for generating the HTML summary
Everything stays in Europe. No data transits through US servers.
The core: similarity search
This is where the magic happens. Every article is stored as a vector of 3584 numbers — a mathematical representation of its meaning. Each interest topic is also converted into a vector.
The search finds articles whose vector is closest to the interest’s vector. We use cosine distance (the <=> operator in pgvector), which measures the angle between two vectors — the correct metric for text embeddings.
SELECT titre, lien,
(embedding <=> '[...]'::vector)
* CASE WHEN language = 'fr' THEN 0.75 ELSE 1.0 END AS distance
FROM rss_articles
WHERE to_timestamp(published_at::bigint) > NOW() - INTERVAL '30 days'
ORDER BY distance ASC
LIMIT 5
The 0.75 multiplier on French articles is a soft relevance bonus to compensate for any model bias toward English.
Why one embedding per interest?
I could have concatenated all my interests into a single string and generated one vector. Simpler, but far less effective.
Consider: “electric car” and “software supply chain” are radically different topics. An average vector represents neither correctly — it lands somewhere in the middle of nowhere. By embedding each interest separately and running N queries, each topic finds its own relevant articles.
Why BGE-Multilingual-Gemma2?
The choice of embedding model is critical. I started with all-MiniLM-L12-v2: an English-only model with 384 dimensions, fast and popular. The result: my French queries produced incoherent vectors, and the articles surfaced had no connection to my actual interests.
BGE-Multilingual-Gemma2 (BAAI) changes everything: 9 billion parameters, 3584 dimensions, natively trained on French and English. Results are immediately relevant.
⚠️ The model must be identical between ingestion and search. Switching models means recomputing all existing embeddings.
The output
Every morning around 7:30 AM I receive a structured email:
- IT News section: summaries of Red Hat, OpenShift, DevSecOps, AI articles…
- Home Automation section: Home Assistant, new devices…
- For each article: 2-sentence summary in French + clickable link
- A final recommendation from the LLM
All rendered as clean HTML, readable in any email client.
Lessons learned
Cosine distance, not Euclidean. The <-> operator in pgvector measures Euclidean distance, which is the wrong metric for text embeddings. <=> measures the angle — that’s the right one.
LangChain can betray you. I originally used n8n’s native LangChain nodes for ingestion. The document loader treated JSON as raw text and fragmented each article into a dozen separate rows. Bypassing LangChain entirely — direct HTTP calls + raw SQL — fixed it completely.
Timestamps in the database. FreshRSS returns dates as Unix timestamps (seconds). PostgreSQL cannot directly cast an integer to timestamptz. The correct conversion: to_timestamp((metadata->>'published_at')::bigint).
pgvector and high dimensions. The ivfflat and hnsw index types are limited to 2000 dimensions in pgvector < 0.7.0. BGE produces 3584 dims: no index possible without the halfvec type (pgvector ≥ 0.7.0). For a few hundred articles, a sequential scan is perfectly fine.
Going further
The workflow is available on GitHub. To adapt it:
- Replace FreshRSS with any source: n8n’s native RSS node, an HTTP call to another aggregator
- Change the AI provider: any OpenAI-compatible API works (Mistral, local Ollama…) — the key is to keep the same model between ingestion and search
- Adjust interests: a plain text field in the workflow, changeable without touching any code
The full stack runs on a modest VPS. It’s the tech monitoring setup I wish I’d had for the past ten years.