reference incident runbook

How to use this runbook

When an incident occurs, find the matching scenario below and follow the steps in order. Each playbook is self-contained — start at step 1, don't skip ahead.

For ANY incident: open ~/Desktop/SkyRun/audit/incidents/<YYYY-MM-DD>_<short-name>.md and timestamp every action you take. Time-of-detection, time-of-each-step, time-of-resolution. This file is the post-mortem source.

Scenario 1 — Cloudflare API token leaked

Detection signals: unexpected CF deploy notifications, unauthorized KV writes, CF dashboard alerts about new API token activity, any indicator the token is in someone else's hands.

Steps

1. Revoke the leaked token NOW. Cloudflare dashboard → My Profile → API Tokens → find the token → Roll/Revoke. Don't wait to investigate scope first; revoke first.
2. Generate a new token with the same scope (Account → Workers KV Storage Edit + Pages Edit; restrict to your account only).
3. Store new token in Keychain:

bash
   python3 ~/Library/Application\ Support/SkyRun/secrets.py set cloudflare_api_token <new-token>

4. Verify .env shim picks up the new token:

bash
   bash -c 'source ~/Library/Application\ Support/SkyRun/.env && echo "${CLOUDFLARE_API_TOKEN:0:10}..."'

5. Test PWA deploy still works:

bash
   bash ~/Library/Application\ Support/SkyRun/deploy_pwa.sh

6. Audit CF activity logs for unauthorized actions during the window between leak + revoke. Cloudflare dashboard → Audit Log.
7. If unauthorized writes happened to KV: restore from backup (quarterly_backup zips contain pre-incident KV state; manual restore via CF dashboard or API).
8. Document: add entry to ~/Desktop/SkyRun/audit/incidents/<date>_cf_token_leak.md.

Scenario 2 — HubSpot account compromise

Detection signals: unexpected contact deletions, deal stages changing without operator action, login alerts from unfamiliar geos, breach notification from HS.

Steps

1. Lock the HS account: Settings → Account & Billing → Users → operator's user → Reset Password. Force re-auth.
2. Enable HS 2FA if not already (Settings → Security → 2-Step Verification).
3. Audit HS Activity Log for the last 7 days. Filter by IP, action type. Identify what was changed.
4. If sensitive deals were modified or contacts deleted: restore from HS itself if possible (contact properties have history); otherwise restore from quarterly backup.
5. Revoke any HS API tokens (Settings → Integrations → Private Apps → revoke + recreate).
6. Update CSRF cookie + session: log out, log back in, regenerate cookies. chrome_bridge will pick up new session automatically.
7. Notify HubSpot if you suspect their breach: trust@hubspot.com.
8. Document. Same audit pattern.

Scenario 3 — Mac stolen / lost

Detection signals: Mac is physically gone.

Steps

1. iCloud Find My Mac: mark as lost. Erase remotely if FileVault is OFF (currently is — see reference_security_posture.md operator-action #1).
2. Revoke ALL Cloudflare tokens in CF dashboard.
3. Sign out of HubSpot (Settings → Sessions → Revoke all on this account → Sign out everywhere).
4. Revoke Google Workspace devices: myaccount.google.com → Security → Your devices → remove the lost Mac.
5. Rotate Anthropic API key (Anthropic dashboard).
6. Provision a replacement Mac: restore from quarterly backup zip.
- Restore order: install macOS, install Claude Code + dependencies, sign back into Workspace + iCloud, restore SkyRun zip to ~/Desktop/SkyRun and ~/Library/Application Support/SkyRun, restore memory dir to ~/.claude/projects/-Users-<user>-Desktop-SkyRun/memory/.
- Re-add secrets to Keychain (secrets.py set ...) using a NEW set of tokens (don't reuse old).
- Run bash ~/Library/Application\ Support/SkyRun/quarterly_backup.sh to confirm baseline OK.
- Reload launchd agents: launchctl load ~/Library/LaunchAgents/com.skyrun.system-hygiene.plist etc.
- Re-register scheduled tasks via mcp__scheduled-tasks__create_scheduled_task (use installer/components-manifest.json from package).
- Run bash ~/Library/Application\ Support/SkyRun/gate_proof_runner.sh — should pass 57/57.
7. Notify Joseph's contacts if needed (HS support, Anthropic, Cloudflare) about the lost device.
8. Document.

Scenario 4 — Knowledge graph corruption

Detection signals: knowledge_graph.json won't parse, gate-proof fails on KG schema check, downstream skills error on KG read.

Steps

1. Check for the candidate file: ls /Users/josephbowens/Desktop/SkyRun/knowledge_graph.json.candidate_* — nightly-consolidation writes a candidate before overwriting on parse failure.
2. Compare candidate to live: diff <(jq -S . knowledge_graph.json.candidate_<date>) <(jq -S . knowledge_graph.json).
3. If candidate is good: restore via mv knowledge_graph.json knowledge_graph.json.broken_<date> && mv knowledge_graph.json.candidate_<date> knowledge_graph.json.
4. If both are bad: restore from quarterly backup. Latest zip in ~/Library/Application Support/SkyRun_Backups/SkyRun_<quarter>_<timestamp>.zip. Extract desktop_SkyRun/knowledge_graph.json.
5. Validate the restored KG:

bash
   python3 ~/Library/Application\ Support/SkyRun/schema_guards.py validate kg ~/Desktop/SkyRun/knowledge_graph.json

6. Recompute health summary:

bash
   python3 ~/Library/Application\ Support/SkyRun/recompute_health_summary.py

7. Document the corruption + cause if identifiable.

Scenario 5 — Accidentally pushed bad outreach

Detection signals: operator notices a draft was sent that shouldn't have been, recipient replies negatively, an internal stakeholder flags a Joseph-authored email as wrong.

Steps

1. Don't send any further follow-ups. Pause SmartLead campaigns: app.smartlead.ai → each campaign → Pause.
2. Send a corrective email to the affected recipient(s) acknowledging the error briefly (no over-explanation; "Apologies — that previous email was sent in error. Disregard.").
3. Add affected recipients to DNC with reason "sent-in-error-<date>".
4. Audit the source: which skill drafted the bad email? pending_drafts.jsonl for the entry → watchdog_origin field.
5. Tighten the gate: does the relevant skill's HARD RULES section need updating? E.g., if Tim Beegle was missed by prior-decision-check, that gate needed strengthening (which it did 5/2). Same pattern: identify the gap, hardwire a new gate.
6. **Document in feedback_*.md so future skill runs see the lesson.

Scenario 6 — SmartLead account locked / suspended

Detection signals: SmartLead login fails, campaigns paused mysteriously, deliverability tanks, abuse complaint received.

Steps

1. Check sender reputation: SmartLead → Email Accounts → reputation scores. Any below 80%?
2. Pause all campaigns until cause identified.
3. Audit what was sent in the last 7 days: SmartLead → Reports → Sent activity. Look for patterns (template, recipient class, geo).
4. Common causes:
- Hitting a non-deliverable list (DNC bypass) — fix DNC gate
- Email content triggering spam filters — review subject + body templates
- Sender reputation dropped (warm-up issue) — pause + warmup
- Recipient marked as spam — accept loss
5. Reach out to SmartLead support if account is suspended.
6. Restore campaigns only after fix is identified + applied.

Scenario 7 — System hygiene fix_queue runaway

Detection signals: fix_queue.jsonl growing rapidly (>50 entries/day), morning brief has many 🛠 Operator-pending items, hourly system_hygiene reports same items repeatedly.

Steps

1. Read the queue: cat ~/Library/Application\ Support/SkyRun/fix_queue.jsonl | jq -r '.task_id' | sort | uniq -c | sort -rn | head.
2. Identify the dominant task_id: that's the runaway source.
3. Find which skill is queueing: grep system_hygiene + skills for the task_id pattern.
4. Triage: is the underlying condition real (operator action needed) or is it a false-positive (skill bug)?
5. If false-positive: patch the skill's queueing logic, drain stale queue items via python3 -c "..." script.
6. If real condition: address the root cause (e.g., chrome_js_toggle off → flip toggle).
7. Verify gate-proof passes after drain.

Scenario 8 — Compute resource exhaustion (Mac runs slow / out of disk)

Detection signals: Mac fan running constantly, disk warnings, scheduled tasks taking >5x longer than usual.

Steps

1. Disk space: df -h / — if <10GB free, problem.
2. Find biggest disk consumers:** du -sh ~/Library/Application\ Support/SkyRun/ ~/Desktop/SkyRun/ → look for runaway folders.
3. Common culprits: health/ directory if heartbeat retention broke, Email Scans/ if not pruned, Call Transcripts/ over time.
4. Clean up: delete heartbeats >30d (system_hygiene should auto-prune; if not, manual: find ~/Library/Application\ Support/SkyRun/health -mtime +30 -delete).
5. Process load: ps aux | sort -k 3 -rn | head (top CPU). Kill runaway zombies.
6. Memory pressure: activity monitor — high memory? Restart Claude Code app.

Templates

Incident audit file template

~/Desktop/SkyRun/audit/incidents/<YYYY-MM-DD>_<short-name>.md:

markdown
Incident: <name>

Detected: <ISO timestamp>
Reported by: Joseph / system_hygiene / fleet dashboard / vendor notice
Severity: P0 / P1 / P2

Timeline
HH:MM — detection
HH:MM — initial response
HH:MM — root cause identified
HH:MM — mitigation applied
HH:MM — verified resolved

Root cause
<1-2 paragraphs>

Mitigation
<what was done>

Residual risk
<what's still exposed; what's monitored going forward>

Action items
[ ] Patch X in skill Y
[ ] Update vendor register entry
[ ] Add gate-proof check for similar regression

Scenario 9 — Gmail / Workspace account compromise (added 2026-05-04 audit)

Detection:

Google security alert email ("New sign-in from <unknown>")
Unexpected forwarding rules visible in Gmail Settings → Forwarding and POP/IMAP
Sent items containing emails the operator didn't write
live-ea or transcript-scan flags messages with anomalous headers

Severity: P0 (full PII access — email, contacts, Drive, Calendar)

Steps:
1. Revoke active sessions: myaccount.google.com → Security → Manage all devices → Sign out unknown
2. Reset password + force MFA reset
3. Audit Gmail Settings → Filters and Forwarding — delete unknown forwarding rules
4. Audit Drive sharing → Activity log → revoke unauthorized
5. Audit Calendar for unauthorized event invites
6. Re-auth Claude Code Gmail MCP
7. Check Workspace audit log: admin.google.com → Reports → Audit log
8. Verify Sent / Drafts / Outbox for outbound on operator's behalf

Rollback: If outbound was sent, send corrections + add affected to DNC for 30d. If Drive files accessed, alert anyone whose data was in those files.

---

Scenario 10 — BeenVerified session failure / account suspension (added 2026-05-04 audit)

Detection:

daily-beenverified-enrichment heartbeat shows status=error or chrome_bridge_status=auth_expired
BV redirects to login page despite cached session
BV API rate-limit error in heartbeat metrics

Severity: P2 (loses daily enrichment cadence — pipeline survives)

Steps:
1. Open Chrome BV tab → manually log in
2. Verify subscription tier (Premium 400/mo cap)
3. Check enrichment_rotation_state.json → monthly_budget.consumed_this_month for quota
4. If suspended: contact BV support; queue lookups via alternative source until resolved
5. Re-run task: python3 ~/Library/Application\ Support/SkyRun/bv_driver.py --max-leads 1
6. Resume cron schedule

Rollback: Pause daily-beenverified-enrichment if BV unavailable >3 days; manual when restored.

---

Scenario 11 — Scheduled task silent de-registration (added 2026-05-04 audit)

Detection:

Morning brief missing expected task output
Heartbeat >2× expected cadence old
Task absent from mcp__scheduled-tasks__list_scheduled_tasks
Live precedent: Apr 30 cron-restore task hard-failed silently — 6 tasks stuck on one-time fireAt for 4 days. Caught only via deep audit 2026-05-04.

Severity: P1 (reconcilers not firing; drift accumulates silently)

Steps:
1. Run mcp__scheduled-tasks__list_scheduled_tasks to confirm absence or wrong-mode (one-time vs cron)
2. Read ~/Library/Application Support/SkyRun/cron_state_pre_fire_*.json for last known cron expression
3. Re-register via mcp__scheduled-tasks__update_scheduled_task with original cronExpression + enabled:true
4. If SKILL.md missing: restore from quarterly backup
5. Verify next firing within expected cadence (heartbeat appears)

Lesson learned: scheduled-tasks MCP refuses to UPDATE tasks from within a scheduled-task session. Use an interactive Claude Code session for repairs.

Rollback: Reconcilers are idempotent — one manual fire catches up. SoT/HS re-sync within 1 cycle.

---

Scenario 12 — Chrome AppleScript automation disabled (added 2026-05-04 audit)

Detection:

chrome_bridge errors: "Executing JavaScript through AppleScript is turned off"
bv_driver, sot_reconciler, engagement_reconciler heartbeats: chrome_bridge_status=session_lost / timeout
Mac or Chrome was upgraded — security defaults reset
Live precedent: Chrome auto-disabled this 5+ times during 2026-05-04 session

Severity: P0 (every Chrome-using reconciler stops)

Steps:
1. Chrome menu bar → View → Developer → Allow JavaScript from Apple Events → check ON
2. System Settings → Privacy & Security → Automation → Terminal/iTerm/Claude → Google Chrome ON
3. Verify: osascript -e 'tell application "Google Chrome" to return URL of active tab of window 1' (should print URL, not error)
4. Re-fire failed reconciler: python3 ~/Library/Application\ Support/SkyRun/sot_reconciler.py
5. If toggling keeps reverting: report Apple/Chrome bug; tighten chrome_bridge retry to absorb it

Rollback: Reconcilers are idempotent.

---

Scenario 13 — iCloud Advanced Data Protection verification (added 2026-05-04 audit)

This is a quarterly preventive check, not an incident.

Why: Without ADP, Apple holds keys to anything in iCloud Drive — including ~60GB of SkyRun_Backups historical PII. With ADP, only operator's devices can decrypt.

Steps:
1. Settings → [Apple ID at top] → iCloud → Advanced Data Protection
2. If Off: tap → follow prompts. Requires device passcode + recovery contact OR recovery key.
3. If On: confirm recovery contact/key still valid.

Trade-off: ADP makes iCloud data unrecoverable if recovery contact + key are lost. Acceptable for L3-PII.

---

Cross-references

reference_data_classification.md — what data is at risk in each scenario
reference_vendor_security_posture.md — vendor contacts + breach pathways
reference_disaster_recovery.md — Mac-loss specific
reference_security_posture.md — overall architecture