High-volume multi-source data pipeline
A national event-aggregator platform
The problem
The client needed to absorb event listings from 200+ disparate sources (municipal calendars, library systems, museum sites, third-party feeds), normalize them into a single structured format, deduplicate across sources, and emit clean records ready for downstream consumption.
The approach
We designed a production pipeline that treats source heterogeneity as a first-class concern: per-source adapters, a unified intermediate schema, an LLM-assisted normalization layer for messy fields, and a deduplication pass that handles fuzzy matches across name, time, and venue. Observability is built in: every source has a freshness signal and a quality score.
Architecture
[ Architecture sketch placeholder, replace with diagram when ready. ]
- Source adapters: One per source format. Scraping (Playwright), API polling, ICS feed parsing, vendor-specific exports.
- Normalization layer: LLM-assisted for messy free-text fields. Schema-checked output.
- Deduplication: Fuzzy matching on name + time window + venue.
- Output: Clean structured records emitted to the client’s primary database.
- Monitoring: Per-source freshness, per-source quality score, alerting on drift.
Results
- 100,000+ records/month absorbed, normalized, and emitted.
- 200+ sources in production, with new sources added in under a day.
- Client’s editorial team reclaimed approximately 30 hours/week previously spent on manual review.
Stack
Python, Playwright, LLM-assisted normalization, PostgreSQL, scheduled workers on AWS.