Data · Engineering · AI
Vendor intelligence feed
An automated pipeline that tracks feature launches, pricing changes, partnerships, and architectural shifts from 15 data integration companies — extracted daily by Claude and surfaced in a live filterable feed.
The data integration space moves fast. Fivetran ships a new connector, dbt Labs drops a pricing tier, Airbyte announces a partnership — and that information is scattered across blog posts, GitHub release notes, and press pages with no easy way to track it at scale. This pipeline changes that.
Every morning, a GitHub Actions job scrapes the blog and GitHub releases pages for 15 companies, sends the raw content to Claude Haiku for structured extraction, and upserts the results into Neon Postgres. A Next.js page reads from Postgres and renders the feed — filterable by company and entity type, with state reflected in the URL.
Recent from the feed
How it works
The pipeline runs in three stages:
-
Scrape —
getSourceUrls()assembles a list of URLs per company, preferring RSS feeds where available and falling back to the blog root. For companies with GitHub release notes, the releases page is also included. Each page is fetched and content-hashed (SHA-256). If the hash matches what's already in Postgres, the page is skipped — no unnecessary extraction. -
Extract — Pending pages are passed to Claude Haiku via a single
tool_usecall that returns all four entity types at once: feature launches, pricing changes, partnerships, and architectural shifts. The tool schema enforces date format, required fields, and company attribution. Missing or malformed fields default gracefully rather than abort the run. -
Upsert — Extracted entities are written to four typed tables (
vf_feature_launches,vf_pricing_changes,vf_partnerships,vf_architectural_shifts). Duplicate source URLs are resolved by updating only when the content hash changes.
A separate weekly cron resets any failed pages back to pending and re-runs extraction — so transient scrape failures are recovered automatically without manual intervention.
The feed
The live feed shows everything the pipeline has extracted. Filters for company and entity type are multi-select and URL-reflected — so filtered views are shareable. The company list is derived from actual data, so companies with no extracted entities don't appear as empty filter options.
First pipeline run returned 321 entities across 12 companies. The retry worker recovered all 3 initially-failed pages on its first pass.
Stack
What's next
Phase 2 — Article-level crawling
The current pipeline fetches one page per company — a blog index or RSS feed — which means Claude extracts entities from excerpt blurbs rather than full article text. The result is missing dates, partial descriptions, and lower confidence on partnerships and architectural shifts. Phase 2 replaces the index-page scrape with full article ingestion.
Design decisions, resolved:
- Discovery strategy — RSS-first. Companies with
rss_urlconfigured get article-level crawling; companies without RSS fall back to the existing single-page blog index scrape until a feed is found. GitHub releases pages stay as single-page scrapes — they already render full release content inline. - Recency window — Only articles published within the last 90 days are fetched. Older items are filtered during RSS parsing, before any HTTP requests are made.
- Pipeline shape — Discovery folds into the scrape step. The external pipeline shape (scrape → extract → embed) stays unchanged;
scrapeCompany()internally parses the RSS feed, collects article URLs, and fetches them in the same pass. - Concurrency — Article fetches run with a global cap of 5 concurrent requests to avoid rate-limiting vendor servers. Steady-state daily volume (10–20 new articles across all companies) makes this mostly a first-run concern.
- Index pages dropped — Blog index pages are no longer stored in
vf_raw_pages. They're used transiently to discover article URLs and discarded. One row per article, keyed on article URL. - Clean migration — Existing entity data was extracted from index blurbs and is lower quality. A dedicated
migrate-vendor-feed-reset.tsscript truncates all entity tables and clearsvf_raw_pagesbefore the first article-crawl run. Everything re-extracts from full article text. - Model — Stays on Claude Haiku. The quality improvement from full articles outweighs any gain from a larger model on the same structured extraction task.
Also planned:
- Payload cap — the feed query is currently uncapped; a LIMIT + total count label will keep the RSC payload bounded as the dataset grows.
- Pipeline status card — the existing Pipeline Dashboard will get a vendor feed card showing last run time, entity count, and health.
- Validations — record count floors, staleness checks, and deduplication audits to surface data quality issues rather than silently pass bad data.