High-volume multi-source data pipeline

The problem

The client needed to absorb event listings from 200+ disparate sources (municipal calendars, library systems, museum sites, third-party feeds), normalize them into a single structured format, deduplicate across sources, and emit clean records ready for downstream consumption.

The approach

We designed a production pipeline that treats source heterogeneity as a first-class concern: per-source adapters, a unified intermediate schema, an LLM-assisted normalization layer for messy fields, and a deduplication pass that handles fuzzy matches across name, time, and venue. Observability is built in: every source has a freshness signal and a quality score.

Architecture

[ Architecture sketch placeholder, replace with diagram when ready. ]

Source adapters: One per source format. Scraping (Playwright), API polling, ICS feed parsing, vendor-specific exports.
Normalization layer: LLM-assisted for messy free-text fields. Schema-checked output.
Deduplication: Fuzzy matching on name + time window + venue.
Output: Clean structured records emitted to the client’s primary database.
Monitoring: Per-source freshness, per-source quality score, alerting on drift.

Results

100,000+ records/month absorbed, normalized, and emitted.
200+ sources in production, with new sources added in under a day.
Client’s editorial team reclaimed approximately 30 hours/week previously spent on manual review.

Stack

Python, Playwright, LLM-assisted normalization, PostgreSQL, scheduled workers on AWS.

High-volume multi-source data pipeline

The problem

The approach

Architecture

Results

Stack

Bring us the hard problem.