Launch Cutover Runbook - Stella Ops

Document owner: DevOps Guild (2025-10-26)
Scope: Full-platform launch from staging to production for release 2025.09.2.

1. Roles and Communication

RolePrimaryBackupContact
Cutover leadDevOps Guild (on-call engineer)Platform Ops lead#launch-bridge (Mattermost)
Authority stackAuthority Core guild repSecurity guild rep#authority
Scanner / QueueScanner WebService guild repRuntime guild rep#scanner
StorageMongo/MinIO operatorsBackup DB adminPager escalation
ObservabilityTelemetry guild repSRE on-call#telemetry
ApprovalsProduct owner + CTODevOps leadApproval recorded in change ticket

Set up a bridge call 30 minutes before start and keep #launch-bridge updated every 10 minutes.

2. Timeline Overview (UTC)

TimeActivityOwner
T-24hChange ticket approved, prod secrets verified, offline kit build status checked (DEVOPS-OFFLINE-18-005).DevOps lead
T-12hRun deploy/tools/validate-profiles.sh; capture logs in ticket.DevOps engineer
T-6hFreeze non-launch deployments; notify guild leads.Product owner
T-2hExecute rehearsal in staging (Section 3) using values-stage.yaml to verify scripts.DevOps + module reps
T-30mFinal go/no-go with guild leads; confirm monitoring dashboards green.Cutover lead
T0Execute production cutover steps (Section 4).Cutover team
T+45mSmoke tests complete (Section 5); announce success or trigger rollback.Cutover lead
T+4hPost-cutover metrics review, notify stakeholders, close ticket.DevOps + product owner

3. Rehearsal (Staging) Checklist

  1. docker network create stellaops_frontdoor || true (if not present on staging jump host).
  2. Run deploy/tools/validate-profiles.sh and archive output.
  3. Apply staging secrets (kubectl apply -f secrets/stage/*.yaml or helm secrets upgrade) ensuring stellaops-stage credentials align with values-stage.yaml.
  4. Perform helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-stage.yaml in staging cluster.
  5. Verify health endpoints: curl https://authority.stage.../healthz, curl https://scanner.stage.../healthz.
  6. Execute smoke CLI: stellaops-cli scan submit --profile staging --sbom samples/sbom/demo.json and confirm report status in UI.
  7. Document total wall time and any deviations in the rehearsal log.

Rehearsal must complete without manual interventions before proceeding to production.

4. Production Cutover Steps

4.1 Pre-flight

  • Confirm production secrets in the appropriate secret store (stellaops-prod-core, stellaops-prod-mongo, stellaops-prod-minio, stellaops-prod-notify) contain the keys referenced in values-prod.yaml.
  • Ensure the external reverse proxy network exists: docker network create stellaops_frontdoor || true on each compose host.
  • Back up current configuration and data:
    • Mongo snapshot: mongodump --uri "$MONGO_BACKUP_URI" --out /backups/launch-$(date -Iseconds).
    • MinIO policy export: mc mirror --overwrite minio/stellaops minio-backup/stellaops-$(date +%Y%m%d%H%M).

4.2 Apply Updates (Compose)

  1. On each compose node, pull updated images for release 2025.09.2:
    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml pull
    
  2. Deploy changes:
    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml up -d
    
  3. Confirm containers healthy via docker compose ps and docker logs <service> --tail 50.

4.3 Apply Updates (Helm/Kubernetes)

If using Kubernetes, perform:

helm upgrade stellaops deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml --atomic --timeout 15m

Monitor rollout with kubectl get pods -n stellaops --watch and kubectl rollout status deployment/<service>.

4.4 Configuration Validation

  • Verify Authority issuer metadata: curl https://authority.prod.../.well-known/openid-configuration.
  • Validate Signer DSSE endpoint: stellaops-cli signer verify --base-url https://signer.prod... --bundle samples/dsse/demo.json.
  • Check Scanner queue connectivity: docker exec stellaops-scanner-web dotnet StellaOps.Scanner.WebService.dll health queue (returns success).
  • Ensure Notify (legacy) still accessible while Notifier migration pending.

5. Smoke Tests

TestCommand / ActionExpected Result
API healthcurl https://scanner.prod.../healthzHTTP 200 with status":"Healthy"
Scan submitstellaops-cli scan submit --profile prod --sbom samples/sbom/demo.jsonScan completes < 5 minutes; report accessible with signed DSSE
Runtime event ingestPost sample event from Zastava observer fixture/runtime/events responds 202 Accepted; record visible in Mongo runtime_events
Signingstellaops-cli signer sign --bundle demo.jsonReturns DSSE with matching SHA256 and signer metadata
Attestor verifystellaops-cli attestor verify --uuid <uuid>Verification result ok=true
Web UIManual login, verify dashboards render and latency within budgetUI loads under 2 seconds; policy views consistent

Log results in the change ticket with timestamps and screenshots where applicable.

6. Rollback Procedure

  1. Assess failure scope; if systemic, initiate rollback immediately while preserving logs/artifacts.
  2. For Compose:
    docker compose --env-file prod.env -f deploy/compose/docker-compose.prod.yaml down
    docker compose --env-file stage.env -f deploy/compose/docker-compose.stage.yaml up -d
    
  3. For Helm:
    helm rollback stellaops <previous-release-number> --namespace stellaops
    
  4. Restore Mongo snapshot if data inconsistency detected: mongorestore --uri "$MONGO_BACKUP_URI" --drop /backups/launch-<timestamp>.
  5. Restore MinIO mirror if required: mc mirror minio-backup/stellaops-<timestamp> minio/stellaops.
  6. Notify stakeholders of rollback and capture root cause notes in incident ticket.

7. Post-cutover Actions

  • Keep heightened monitoring for 4 hours post cutover; track latency, error rates, and queue depth.
  • Confirm audit trails: Authority tokens issued, Scanner events recorded, Attestor submissions stored.
  • Update docs/modules/devops/runbooks/launch-readiness.md if any new gaps or follow-ups discovered.
  • Schedule retrospective within 48 hours; include DevOps, module guilds, and product owner.

8. Approval Matrix

StepRequired ApproversRecord Location
Production deployment planCTO + DevOps leadChange ticket comment
Cutover start (T0)DevOps lead + module reps#launch-bridge summary
Post-smoke successDevOps lead + product ownerChange ticket closure
Rollback (if invoked)DevOps lead + CTOIncident ticket

Retain all approvals and logs for audit. Update this runbook after each execution to record actual timings and lessons learned.

9. Rehearsal Log

Date (UTC)What We ExercisedOutcomeFollow-up
2025-10-26Dry-run of compose/Helm validation via deploy/tools/validate-profiles.sh (dev/stage/prod/airgap/mirror). Network creation simulated (docker network create stellaops_frontdoor planned) and stage CLI submission reviewed.Validation script succeeded; all profiles templated cleanly. Stage deployment apply deferred because no staging cluster is accessible from the current environment.Schedule full stage rehearsal once staging cluster credentials are available; reuse this log section to capture timings.