Data · Engineering · AI
Vendor intelligence feed
An automated pipeline that tracks feature launches, pricing changes, partnerships, and architectural shifts from 15 data integration companies — extracted daily by Claude and surfaced in a live filterable feed.
The data integration space moves fast. Fivetran ships a new connector, dbt Labs drops a pricing tier, Airbyte announces a partnership — and that information is scattered across blog posts, GitHub release notes, and press pages with no easy way to track it at scale. This pipeline changes that.
Every morning, a GitHub Actions job scrapes the blog and GitHub releases pages for 15 companies, sends the raw content to Claude Haiku for structured extraction, and upserts the results into Neon Postgres. A Next.js page reads from Postgres and renders the feed — filterable by company and entity type, with state reflected in the URL.
Recent from the feed
How it works
The pipeline runs in three stages:
-
Scrape —
getSourceUrls()assembles a list of URLs per company, preferring RSS feeds where available and falling back to the blog root. For companies with GitHub release notes, the releases page is also included. Each page is fetched and content-hashed (SHA-256). If the hash matches what's already in Postgres, the page is skipped — no unnecessary extraction. -
Extract — Pending pages are passed to Claude Haiku via a single
tool_usecall that returns all four entity types at once: feature launches, pricing changes, partnerships, and architectural shifts. The tool schema enforces date format, required fields, and company attribution. Missing or malformed fields default gracefully rather than abort the run. -
Upsert — Extracted entities are written to four typed tables (
vf_feature_launches,vf_pricing_changes,vf_partnerships,vf_architectural_shifts). Duplicate source URLs are resolved by updating only when the content hash changes.
A separate weekly cron resets any failed pages back to pending and re-runs extraction — so transient scrape failures are recovered automatically without manual intervention.
The feed
The live feed shows everything the pipeline has extracted. Filters for company and entity type are multi-select and URL-reflected — so filtered views are shareable. The company list is derived from actual data, so companies with no extracted entities don't appear as empty filter options.
First pipeline run returned 321 entities across 12 companies. The retry worker recovered all 3 initially-failed pages on its first pass.
Stack
What's next
The pipeline is accumulating daily. A few things are planned for phase 2:
- Payload cap — the feed query is currently uncapped; a LIMIT + total count label will keep the RSC payload bounded as the dataset grows.
- Pipeline status card — the existing Pipeline Dashboard will get a vendor feed card showing last run time, entity count, and health.
- Validations — record count floors, staleness checks, and deduplication audits to surface data quality issues rather than silently pass bad data.