Always-live data fleet

Joseph 2026-05-09: "We need to figure out how we keep all of this live — all the time, including local data sources."

The fleet is now three layers stacked on top of each other. Each layer is independent and bash-only (zero Claude tokens). If any one layer fails, the others compensate.

Layer 1 · Cron floor — 15-min extractor cadence

Every chrome_bridge extractor fires every 15 minutes via launchd StartInterval=900. This is the passive freshness floor — no signal needed, just runs.

Job	Cadence	What it does
`com.skyrun.track-metrics`	every 15 min	`extract_track_metrics.py` → `pwa/track_metrics.json`
`com.skyrun.keydata-metrics`	every 15 min	`extract_keydata_metrics.py` → `pwa/keydata_metrics.json`
`com.skyrun.smartlead-metrics`	every 15 min	`extract_smartlead_metrics.py` → `pwa/outbound.json`
`com.skyrun.session-keep-alive`	every 10 min	`session_keep_alive.py` → `pwa/source_health.json`
`com.skyrun.pwa-periodic-refresh`	every 30 min	`build_pwa.py` + `deploy_pwa.sh`

(Was 3-8 fires/day at clustered cron times. Migrated 2026-05-09 to flat 15-min interval — .bak.20260509T112728 for rollback.)

Layer 2 · Watchdog enforcer — 3-min reactive check

freshness_watchdog.py runs every 3 min via com.skyrun.freshness-watchdog. For every source, it:

1. Reads pwa/source_health.json for probe state
2. Reads each metrics file for captured_at
3. Computes age. If older than per-source threshold:
- chrome_bridge source + probe says alive → run extractor (with WrongAccountError retry)
- chrome_bridge source + probe says expired/no_tab → skip + push ntfy on first detection
- local file → run build_pwa.py + deploy_pwa.sh
4. Pushes ntfy on 3-consecutive-failure
5. Writes heartbeat to health/freshness_watchdog_*.json

Per-source max_age_min thresholds (in freshness_watchdog.py):

track: 15 min · keydata: 30 min · smartlead: 15 min
deals/commitments/fleet: 60 min
market: 90 min · leads: 120 min · lead_coords: 240 min · lint_sweep: 240 min

Effective ceiling on staleness = max_age_min + 3 min (the watchdog's polling interval).

Layer 3 · On-demand triggers — instant refresh

extract_trigger_runner.py runs every 30s polling Cloudflare KV for "Trigger now" requests posted from the PWA Sources page (P5 freshness pill button). Joseph clicks "Trigger now" → trigger lands in KV → runner picks up within 30s → extractor fires → redeploy → PWA shows fresh state within ~60s end-to-end.

Failure modes + recovery

Failure	Detection	Recovery
Browser session expired	probe flips to `expired`	ntfy push w/ click-to-Sources URL; watchdog skips trigger; user re-auths
Extractor exits 1 (WrongAccountError)	watchdog _run_extractor reads stderr	retries 2x with 30s pause; if still failing → ntfy on 3rd consecutive
Extractor exits 0 but writes auth_expired metrics	watchdog reads chrome_bridge_status from output file	counts as failure for freshness purposes; ntfy on first detection (deduped 60 min)
build_pwa fails	watchdog captures non-zero rc	adds to errors[]; partial heartbeat; ntfy via dq-realtime-monitor on next tick
Local file >threshold + no producer running	watchdog detects stale mtime	kicks build_pwa + deploy

What's NOT in the fleet (and why)

HubSpot — no extractor; data pulled live by skills as needed (engagement-reconciler, sot-reconciler)
Gmail — same reason; live MCP pulls per skill
BeenVerified — daily scheduled task only (12/day Premium budget); freshness is by design slow

How to add a new source

1. If chrome_bridge: write extract_{source}.py, add launchd plist with StartInterval=900, register in EXTRACTORS dict in extract_trigger_runner.py and in CHROME_SOURCES in freshness_watchdog.py, add probe handler in session_keep_alive.py
2. If local: write the producer skill, ensure it writes a captured_at timestamp, add entry to LOCAL_FILES in freshness_watchdog.py with appropriate max_age_min
3. Add the source to the PWA Sources page registry (pwa/preview-sources.html)
4. Update deploy_pwa.sh to copy the new file (cherry-pick gotcha — see ~/Documents/SkyRun_Remediation_2026-05-08/HANDOFF_TO_GC_SYSTEM.md Section 3)

Verification commands

bash
Watchdog last run
cat "$(ls -t ~/Library/Application\ Support/SkyRun/health/freshness_watchdog_*.json | head -1)"

Live cron schedule
for p in com.skyrun.{track-metrics,keydata-metrics,smartlead-metrics,freshness-watchdog,session-keep-alive}; do
  echo "=== $p ==="
  grep -A1 "StartInterval" ~/Library/LaunchAgents/$p.plist
done

Per-source freshness (live)
cd ~/Library/Application\ Support/SkyRun/pwa
for f in track_metrics.json keydata_metrics.json outbound.json; do
  echo "=== $f ==="
  python3 -c "import json; d=json.load(open('$f')); print('captured_at:', d.get('captured_at'), '| chrome_bridge_status:', d.get('chrome_bridge_status'))"
done

Why this is safe (token-wise)

Every layer is bash + Python via launchd — zero Claude tokens. The Cowork Claude scheduled tasks (sot-reconciler, engagement-reconciler, qb, etc.) were the token-burners and got slowed 5/9 (see reference_token_optimization_2026-05-09.md). The freshness fleet runs on the deterministic plumbing layer, not the AI reasoning layer.

reference freshness fleet