Skip to content

Operations Runbooks

Bootstrap

bash
cd ../CueCrux-Shared
pnpm install
pnpm build

cd ../VaultCrux
pnpm install
pnpm --filter @vaultcrux/api db:migrate
pnpm --filter @vaultcrux/api dev
pnpm --filter @vaultcrux/worker dev

Production-like local bring-up:

bash
cd ../VaultCrux
pnpm stack:up
pnpm e2e:smoke
pnpm e2e:frontdoor:session
pnpm e2e:idempotency
pnpm e2e:replay
pnpm e2e:qdrant
pnpm e2e:canary
pnpm e2e:shield:capability
pnpm e2e:shield:trust
pnpm e2e:shield:approval-taint
pnpm e2e:shield:sandbox
pnpm e2e:shield:sampling
pnpm e2e:shield:kill-switch
pnpm e2e:shield:strict
pnpm exec tsx scripts/e2e-incident-redteam.ts

Bring up monitoring profile (Prometheus + Alertmanager + Grafana):

bash
cd ../VaultCrux
docker compose --profile monitoring up -d prometheus alertmanager grafana paid-path-canary

Validate paid-path canary targets and recent samples:

bash
cd ../VaultCrux
curl -fsS http://127.0.0.1:16390/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="vaultcrux-paid-path-canary") | {health, lastError, scrapeUrl}'
curl -fsS http://127.0.0.1:16390/api/v1/query --data-urlencode 'query=vaultcrux_paid_path_canary_runs_total{flow="frontdoor_session"}'

pnpm e2e:shield:strict must pass before production promotion. It executes the full Shield enforcement suite under enforce-mode flags. pnpm exec tsx scripts/e2e-incident-redteam.ts should be run each quarter to capture incident drill + red-team checklist evidence.

Health checks

  • GET /healthz process liveness
  • GET /readyz DB + migration readiness
  • GET /metrics Prometheus metrics

Shield flag promotion (observe -> enforce)

  1. Observe window (minimum 48h):
    • SHIELD_MODE=observe
    • FEATURE_UI_APP_SANDBOX=false
    • FEATURE_KILL_SWITCHES=false
    • Monitor canary + API error rates + Shield decision tables.
  2. Pre-enforce checklist:
    • pnpm e2e:frontdoor:session
    • pnpm e2e:smoke
    • pnpm e2e:shield:strict
    • Verify no sustained 403 spikes for user-critical routes.
  3. Enforce promotion:
    • set SHIELD_MODE=enforce
    • set FEATURE_UI_APP_SANDBOX=true
    • set FEATURE_KILL_SWITCHES=true
  4. Immediate rollback if false positives block paid path:
    • set SHIELD_MODE=observe
    • set FEATURE_UI_APP_SANDBOX=false
    • set FEATURE_KILL_SWITCHES=false

Common incidents

  • High queue lag:

    • check vaultcrux.citation_staging pending counts
    • scale worker replicas
  • Dead-letter growth:

    • inspect payload.dead_letter_reason
    • replay corrected jobs with a fresh idempotency key
  • Paid-path canary failures:

    • inspect canary logs: docker logs --tail=200 vaultcrux-paid-path-canary
    • query recent counters in Prometheus:
      • increase(vaultcrux_paid_path_canary_runs_total{flow="frontdoor_session",status="failure"}[15m])
      • increase(vaultcrux_paid_path_canary_runs_total{flow="frontdoor_session",status="success"}[15m])
    • validate live path manually: pnpm e2e:frontdoor:session
  • Replay verification:

    • run pnpm e2e:replay to force staging reprocessing and assert idempotent rows
    • run pnpm outbox:replay for bounded outbox redispatch checks
  • Vector latency spikes:

    • verify hnsw index health
    • run pnpm vectors:check for pgvector vs qdrant consistency
    • reduce query limit or disable qdrant read (FEATURE_QDRANT_READ=false) temporarily
  • Conversion apply backlog:

    • verify FEATURE_CREDIT_CONVERSION=true and FEATURE_PADDLE_DISCOUNT_APPLY=true
    • check vaultcrux.subscription_conversions for pending_apply growth
    • inspect worker logs for vaultcrux-worker-conversion
  • Tip sink discrepancies:

    • ensure sink account is @cuecrux only
    • verify debit/credit pair reasons (platform_tip, platform_tip_sink)
    • inspect vaultcrux.platform_tips and vaultcrux.economic_receipts
  • Marketplace transfer anomalies:

    • inspect vaultcrux.economy_anomalies
    • confirm seller anonymity in API/MCP responses
    • verify bundle.purchase.cross_tenant outbox events

Rollback

  • disable flags: CITATIONS_ASYNC_ENABLED=false, CREDITS_BATCH_ENABLED=false
  • vector rollback: FEATURE_QDRANT_READ=false, then FEATURE_VECTOR_DUAL_WRITE=false after stability
  • economy rollback:
    • FEATURE_ECONOMY_MULTIPLIER=false
    • FEATURE_FREE_TIER_ESCROW=false
    • FEATURE_CREDIT_CONVERSION=false
    • FEATURE_PADDLE_DISCOUNT_APPLY=false
    • FEATURE_PLATFORM_TIPS=false
    • FEATURE_CROSS_TENANT_BUNDLES=false
  • mcp rollback:
    • FEATURE_MCP_ENABLED=false
  • shield rollback (disable highest stage first):
    • FEATURE_KILL_SWITCHES=false
    • FEATURE_UI_APP_SANDBOX=false
    • FEATURE_SAMPLING_GUARDIAN=false
    • FEATURE_SHIELD_ENFORCE_APPROVALS=false
    • FEATURE_SHIELD_ENFORCE_TAINT=false
    • FEATURE_SANDBOX_RUNNER=false
    • FEATURE_TRUST_REGISTRY_ENFORCE_DIGEST=false
    • FEATURE_SHIELD_ENFORCE_CAPABILITY=false
    • FEATURE_SHIELD_ENABLED=false
  • keep outbox immutable; only switch read/write execution paths

Backup + restore

  • Daily backup target: vaultcrux postgres database.
  • Retention default: keep 14 daily backups and at least one verified weekly restore point.
  • Backup command:
bash
cd ../VaultCrux
./scripts/backup-postgres.sh
  • Restore command (destructive to current DB contents):
bash
cd ../VaultCrux
./scripts/restore-postgres.sh ./ops/backups/<backup-file.dump>
  • Post-restore verification checklist:
    • pnpm e2e:smoke
    • verify /healthz and /readyz
    • spot-check latest agent_credit_ledger and contribution_citations rows

Data-loss guardrails

  • pnpm stack:down is non-destructive and keeps named volumes.
  • pnpm stack:destroy is destructive (docker compose down -v).
  • Always take a backup before running stack:destroy outside disposable local testing.

Policy-change governance

  1. Stage policy change in vaultcrux.credit_policy (no destructive edits).
  2. Emit credit.policy.updated outbox entry for downstream observability.
  3. Run replay/idempotency tests before enabling policy-dependent features.
  4. Record change rationale and rollback condition in PlanCrux outcome report.

Copyright 2026 CueCrux