Operations Runbooks
Bootstrap
bash
cd ../CueCrux-Shared
pnpm install
pnpm build
cd ../VaultCrux
pnpm install
pnpm --filter @vaultcrux/api db:migrate
pnpm --filter @vaultcrux/api dev
pnpm --filter @vaultcrux/worker devProduction-like local bring-up:
bash
cd ../VaultCrux
pnpm stack:up
pnpm e2e:smoke
pnpm e2e:frontdoor:session
pnpm e2e:idempotency
pnpm e2e:replay
pnpm e2e:qdrant
pnpm e2e:canary
pnpm e2e:shield:capability
pnpm e2e:shield:trust
pnpm e2e:shield:approval-taint
pnpm e2e:shield:sandbox
pnpm e2e:shield:sampling
pnpm e2e:shield:kill-switch
pnpm e2e:shield:strict
pnpm exec tsx scripts/e2e-incident-redteam.tsBring up monitoring profile (Prometheus + Alertmanager + Grafana):
bash
cd ../VaultCrux
docker compose --profile monitoring up -d prometheus alertmanager grafana paid-path-canaryValidate paid-path canary targets and recent samples:
bash
cd ../VaultCrux
curl -fsS http://127.0.0.1:16390/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="vaultcrux-paid-path-canary") | {health, lastError, scrapeUrl}'
curl -fsS http://127.0.0.1:16390/api/v1/query --data-urlencode 'query=vaultcrux_paid_path_canary_runs_total{flow="frontdoor_session"}'pnpm e2e:shield:strict must pass before production promotion. It executes the full Shield enforcement suite under enforce-mode flags. pnpm exec tsx scripts/e2e-incident-redteam.ts should be run each quarter to capture incident drill + red-team checklist evidence.
Health checks
GET /healthzprocess livenessGET /readyzDB + migration readinessGET /metricsPrometheus metrics
Shield flag promotion (observe -> enforce)
- Observe window (minimum 48h):
SHIELD_MODE=observeFEATURE_UI_APP_SANDBOX=falseFEATURE_KILL_SWITCHES=false- Monitor canary + API error rates + Shield decision tables.
- Pre-enforce checklist:
pnpm e2e:frontdoor:sessionpnpm e2e:smokepnpm e2e:shield:strict- Verify no sustained
403spikes for user-critical routes.
- Enforce promotion:
- set
SHIELD_MODE=enforce - set
FEATURE_UI_APP_SANDBOX=true - set
FEATURE_KILL_SWITCHES=true
- set
- Immediate rollback if false positives block paid path:
- set
SHIELD_MODE=observe - set
FEATURE_UI_APP_SANDBOX=false - set
FEATURE_KILL_SWITCHES=false
- set
Common incidents
High queue lag:
- check
vaultcrux.citation_stagingpending counts - scale worker replicas
- check
Dead-letter growth:
- inspect
payload.dead_letter_reason - replay corrected jobs with a fresh idempotency key
- inspect
Paid-path canary failures:
- inspect canary logs:
docker logs --tail=200 vaultcrux-paid-path-canary - query recent counters in Prometheus:
increase(vaultcrux_paid_path_canary_runs_total{flow="frontdoor_session",status="failure"}[15m])increase(vaultcrux_paid_path_canary_runs_total{flow="frontdoor_session",status="success"}[15m])
- validate live path manually:
pnpm e2e:frontdoor:session
- inspect canary logs:
Replay verification:
- run
pnpm e2e:replayto force staging reprocessing and assert idempotent rows - run
pnpm outbox:replayfor bounded outbox redispatch checks
- run
Vector latency spikes:
- verify
hnswindex health - run
pnpm vectors:checkfor pgvector vs qdrant consistency - reduce query limit or disable qdrant read (
FEATURE_QDRANT_READ=false) temporarily
- verify
Conversion apply backlog:
- verify
FEATURE_CREDIT_CONVERSION=trueandFEATURE_PADDLE_DISCOUNT_APPLY=true - check
vaultcrux.subscription_conversionsforpending_applygrowth - inspect worker logs for
vaultcrux-worker-conversion
- verify
Tip sink discrepancies:
- ensure sink account is
@cuecruxonly - verify debit/credit pair reasons (
platform_tip,platform_tip_sink) - inspect
vaultcrux.platform_tipsandvaultcrux.economic_receipts
- ensure sink account is
Marketplace transfer anomalies:
- inspect
vaultcrux.economy_anomalies - confirm seller anonymity in API/MCP responses
- verify
bundle.purchase.cross_tenantoutbox events
- inspect
Rollback
- disable flags:
CITATIONS_ASYNC_ENABLED=false,CREDITS_BATCH_ENABLED=false - vector rollback:
FEATURE_QDRANT_READ=false, thenFEATURE_VECTOR_DUAL_WRITE=falseafter stability - economy rollback:
FEATURE_ECONOMY_MULTIPLIER=falseFEATURE_FREE_TIER_ESCROW=falseFEATURE_CREDIT_CONVERSION=falseFEATURE_PADDLE_DISCOUNT_APPLY=falseFEATURE_PLATFORM_TIPS=falseFEATURE_CROSS_TENANT_BUNDLES=false
- mcp rollback:
FEATURE_MCP_ENABLED=false
- shield rollback (disable highest stage first):
FEATURE_KILL_SWITCHES=falseFEATURE_UI_APP_SANDBOX=falseFEATURE_SAMPLING_GUARDIAN=falseFEATURE_SHIELD_ENFORCE_APPROVALS=falseFEATURE_SHIELD_ENFORCE_TAINT=falseFEATURE_SANDBOX_RUNNER=falseFEATURE_TRUST_REGISTRY_ENFORCE_DIGEST=falseFEATURE_SHIELD_ENFORCE_CAPABILITY=falseFEATURE_SHIELD_ENABLED=false
- keep outbox immutable; only switch read/write execution paths
Backup + restore
- Daily backup target:
vaultcruxpostgres database. - Retention default: keep 14 daily backups and at least one verified weekly restore point.
- Backup command:
bash
cd ../VaultCrux
./scripts/backup-postgres.sh- Restore command (destructive to current DB contents):
bash
cd ../VaultCrux
./scripts/restore-postgres.sh ./ops/backups/<backup-file.dump>- Post-restore verification checklist:
pnpm e2e:smoke- verify
/healthzand/readyz - spot-check latest
agent_credit_ledgerandcontribution_citationsrows
Data-loss guardrails
pnpm stack:downis non-destructive and keeps named volumes.pnpm stack:destroyis destructive (docker compose down -v).- Always take a backup before running
stack:destroyoutside disposable local testing.
Policy-change governance
- Stage policy change in
vaultcrux.credit_policy(no destructive edits). - Emit
credit.policy.updatedoutbox entry for downstream observability. - Run replay/idempotency tests before enabling policy-dependent features.
- Record change rationale and rollback condition in PlanCrux outcome report.

