Data · Engineering · AI

Vendor intelligence feed

An automated pipeline that tracks feature launches, pricing changes, partnerships, and architectural shifts from 15 data integration companies — extracted daily by Claude and surfaced in a live filterable feed.

2026·dataengineeringai·self-initiated

The data integration space moves fast. Fivetran ships a new connector, dbt Labs drops a pricing tier, Airbyte announces a partnership — and that information is scattered across blog posts, GitHub release notes, and press pages with no easy way to track it at scale. This pipeline changes that.

Every morning, a GitHub Actions job scrapes the blog and GitHub releases pages for 15 companies, sends the raw content to Claude Haiku for structured extraction, and upserts the results into Neon Postgres. A Next.js page reads from Postgres and renders the feed — filterable by company and entity type, with state reflected in the URL.

Recent from the feed

Feature LaunchFivetranAgents SchemaJun 29, 2026
ArchitectureFivetranProprietary cloud data warehouses and raw data lakes → Open Data Infrastructure (lakehouse architecture with open storage, file formats, table metadata, and flexible compute)Jun 24, 2026
ArchitectureFivetranArchitecture shiftJun 22, 2026
PartnershipFivetrandbt LabsJun 22, 2026
PartnershipFivetranSnowflakeJun 22, 2026
Feature LaunchClickHousePostgres Managed by ClickHouse - RBAC, Terraform, ClickPipes, extensionsJun 18, 2026
Feature Launchdbt Labsdbt-core v2.0.0-alpha.2Jun 18, 2026

How it works

The pipeline runs in three stages:

  1. ScrapegetSourceUrls() assembles a list of URLs per company, preferring RSS feeds where available and falling back to the blog root. For companies with GitHub release notes, the releases page is also included. Each page is fetched and content-hashed (SHA-256). If the hash matches what's already in Postgres, the page is skipped — no unnecessary extraction.

  2. Extract — Pending pages are passed to Claude Haiku via a single tool_use call that returns all four entity types at once: feature launches, pricing changes, partnerships, and architectural shifts. The tool schema enforces date format, required fields, and company attribution. Missing or malformed fields default gracefully rather than abort the run.

  3. Upsert — Extracted entities are written to four typed tables (vf_feature_launches, vf_pricing_changes, vf_partnerships, vf_architectural_shifts). Duplicate source URLs are resolved by updating only when the content hash changes.

A separate weekly cron resets any failed pages back to pending and re-runs extraction — so transient scrape failures are recovered automatically without manual intervention.

The feed

The live feed shows everything the pipeline has extracted. Filters for company and entity type are multi-select and URL-reflected — so filtered views are shareable. The company list is derived from actual data, so companies with no extracted entities don't appear as empty filter options.

First pipeline run returned 321 entities across 12 companies. The retry worker recovered all 3 initially-failed pages on its first pass.

Stack

GitHub ActionsDaily cron at 7am UTC — scrape, extract, upsert pipelineDAILY
Claude HaikuStructured entity extraction — single tool_use call per page extracts all 4 entity typesDAILY
Neon PostgresEntity storage — 5 tables with vf_ prefix, SHA-256 deduplicationDAILY
Next.jsApp Router server component reads from Postgres; client component handles filtersDAILY
ZodCompany config validation and pipeline schema enforcementDAILY
Tailwind CSSFeed UI — uses site design tokens throughoutDAILY
daily driverregular usespecific cases

What's next

Phase 2 — Article-level crawling

The current pipeline fetches one page per company — a blog index or RSS feed — which means Claude extracts entities from excerpt blurbs rather than full article text. The result is missing dates, partial descriptions, and lower confidence on partnerships and architectural shifts. Phase 2 replaces the index-page scrape with full article ingestion.

Design decisions, resolved:

  • Discovery strategy — RSS-first. Companies with rss_url configured get article-level crawling; companies without RSS fall back to the existing single-page blog index scrape until a feed is found. GitHub releases pages stay as single-page scrapes — they already render full release content inline.
  • Recency window — Only articles published within the last 90 days are fetched. Older items are filtered during RSS parsing, before any HTTP requests are made.
  • Pipeline shape — Discovery folds into the scrape step. The external pipeline shape (scrape → extract → embed) stays unchanged; scrapeCompany() internally parses the RSS feed, collects article URLs, and fetches them in the same pass.
  • Concurrency — Article fetches run with a global cap of 5 concurrent requests to avoid rate-limiting vendor servers. Steady-state daily volume (10–20 new articles across all companies) makes this mostly a first-run concern.
  • Index pages dropped — Blog index pages are no longer stored in vf_raw_pages. They're used transiently to discover article URLs and discarded. One row per article, keyed on article URL.
  • Clean migration — Existing entity data was extracted from index blurbs and is lower quality. A dedicated migrate-vendor-feed-reset.ts script truncates all entity tables and clears vf_raw_pages before the first article-crawl run. Everything re-extracts from full article text.
  • Model — Stays on Claude Haiku. The quality improvement from full articles outweighs any gain from a larger model on the same structured extraction task.

Also planned:

  • Payload cap — the feed query is currently uncapped; a LIMIT + total count label will keep the RSC payload bounded as the dataset grows.
  • Pipeline status card — the existing Pipeline Dashboard will get a vendor feed card showing last run time, entity count, and health.
  • Validations — record count floors, staleness checks, and deduplication audits to surface data quality issues rather than silently pass bad data.

Open the feed →