Data · Engineering · AI

Vendor intelligence feed

An automated pipeline that tracks feature launches, pricing changes, partnerships, and architectural shifts from 15 data integration companies — extracted daily by Claude and surfaced in a live filterable feed.

2026·dataengineeringai·self-initiated

The data integration space moves fast. Fivetran ships a new connector, dbt Labs drops a pricing tier, Airbyte announces a partnership — and that information is scattered across blog posts, GitHub release notes, and press pages with no easy way to track it at scale. This pipeline changes that.

Every morning, a GitHub Actions job scrapes the blog and GitHub releases pages for 15 companies, sends the raw content to Claude Haiku for structured extraction, and upserts the results into Neon Postgres. A Next.js page reads from Postgres and renders the feed — filterable by company and entity type, with state reflected in the URL.

Recent from the feed

PartnershipAirbyteChatGPTMay 19, 2026
Feature LaunchAirbyteUnified MCP GatewayMay 19, 2026
Feature LaunchAirbyteAirbyte Agents in ChatGPTMay 19, 2026
Feature Launchdbt Labsdbt Agent SkillsMay 18, 2026
Feature Launchdbt Labsdbt Agent SkillsMay 18, 2026
PartnershipPrefectSnowflakeMay 15, 2026
PartnershipPrefectSnowflakeMay 15, 2026

How it works

The pipeline runs in three stages:

  1. ScrapegetSourceUrls() assembles a list of URLs per company, preferring RSS feeds where available and falling back to the blog root. For companies with GitHub release notes, the releases page is also included. Each page is fetched and content-hashed (SHA-256). If the hash matches what's already in Postgres, the page is skipped — no unnecessary extraction.

  2. Extract — Pending pages are passed to Claude Haiku via a single tool_use call that returns all four entity types at once: feature launches, pricing changes, partnerships, and architectural shifts. The tool schema enforces date format, required fields, and company attribution. Missing or malformed fields default gracefully rather than abort the run.

  3. Upsert — Extracted entities are written to four typed tables (vf_feature_launches, vf_pricing_changes, vf_partnerships, vf_architectural_shifts). Duplicate source URLs are resolved by updating only when the content hash changes.

A separate weekly cron resets any failed pages back to pending and re-runs extraction — so transient scrape failures are recovered automatically without manual intervention.

The feed

The live feed shows everything the pipeline has extracted. Filters for company and entity type are multi-select and URL-reflected — so filtered views are shareable. The company list is derived from actual data, so companies with no extracted entities don't appear as empty filter options.

First pipeline run returned 321 entities across 12 companies. The retry worker recovered all 3 initially-failed pages on its first pass.

Stack

GitHub ActionsDaily cron at 7am UTC — scrape, extract, upsert pipelineDAILY
Claude HaikuStructured entity extraction — single tool_use call per page extracts all 4 entity typesDAILY
Neon PostgresEntity storage — 5 tables with vf_ prefix, SHA-256 deduplicationDAILY
Next.jsApp Router server component reads from Postgres; client component handles filtersDAILY
ZodCompany config validation and pipeline schema enforcementDAILY
Tailwind CSSFeed UI — uses site design tokens throughoutDAILY
daily driverregular usespecific cases

What's next

The pipeline is accumulating daily. A few things are planned for phase 2:

  • Payload cap — the feed query is currently uncapped; a LIMIT + total count label will keep the RSC payload bounded as the dataset grows.
  • Pipeline status card — the existing Pipeline Dashboard will get a vendor feed card showing last run time, entity count, and health.
  • Validations — record count floors, staleness checks, and deduplication audits to surface data quality issues rather than silently pass bad data.

Open the feed →