Concelier Authority Audit Runbook

Last updated: 2025-10-22

This runbook helps operators verify and monitor the StellaOps Concelier ⇆ Authority integration. It focuses on the /jobs* surface, which now requires StellaOps Authority tokens, and the corresponding audit/metric signals that expose authentication and bypass activity.

1. Prerequisites

  • Authority integration is enabled in concelier.yaml (or via CONCELIER_AUTHORITY__* environment variables) with a valid clientId, secret, audience, and required scopes.
  • OTLP metrics/log exporters are configured (concelier.telemetry.*) or container stdout is shipped to your SIEM.
  • Operators have access to the Concelier job trigger endpoints via CLI or REST for smoke tests.
  • The rollout table in docs/10_CONCELIER_CLI_QUICKSTART.md has been reviewed so stakeholders align on the staged → enforced toggle timeline.

Configuration snippet

concelier:
  authority:
    enabled: true
    allowAnonymousFallback: false          # keep true only during initial rollout
    issuer: "https://authority.internal"
    audiences:
      - "api://concelier"
    requiredScopes:
      - "concelier.jobs.trigger"
      - "advisory:read"
      - "advisory:ingest"
    requiredTenants:
      - "tenant-default"
    bypassNetworks:
      - "127.0.0.1/32"
      - "::1/128"
    clientId: "concelier-jobs"
    clientSecretFile: "/run/secrets/concelier_authority_client"
    tokenClockSkewSeconds: 60
    resilience:
      enableRetries: true
      retryDelays:
        - "00:00:01"
        - "00:00:02"
        - "00:00:05"
      allowOfflineCacheFallback: true
      offlineCacheTolerance: "00:10:00"

Store secrets outside source control. Concelier reads clientSecretFile on startup; rotate by updating the mounted file and restarting the service.

Resilience tuning

  • Connected sites: keep the default 1 s / 2 s / 5 s retry ladder so Concelier retries transient Authority hiccups but still surfaces outages quickly. Leave allowOfflineCacheFallback=true so cached discovery/JWKS data can bridge short Pathfinder restarts.
  • Air-gapped/Offline Kit installs: extend offlineCacheTolerance (15–30 minutes) to keep the cached metadata valid between manual synchronisations. You can also disable retries (enableRetries=false) if infrastructure teams prefer to handle exponential backoff at the network layer; Concelier will fail fast but keep deterministic logs.
  • Concelier resolves these knobs through IOptionsMonitor<StellaOpsAuthClientOptions>. Edits to concelier.yaml are applied on configuration reload; restart the container if you change environment variables or do not have file-watch reloads enabled.

2. Key Signals

2.1 Audit log channel

Concelier emits structured audit entries via the Concelier.Authorization.Audit logger for every /jobs* request once Authority enforcement is active.

Concelier authorization audit route=/jobs/definitions status=200 subject=ops@example.com clientId=concelier-cli scopes=concelier.jobs.trigger advisory:ingest bypass=False remote=10.1.4.7
FieldSample valueMeaning
route/jobs/definitionsEndpoint that processed the request.
status200 / 401 / 409Final HTTP status code returned to the caller.
subjectops@example.comUser or service principal subject (falls back to (anonymous) when unauthenticated).
clientIdconcelier-cliOAuth client ID provided by Authority ((none) if the token lacked the claim).
scopesconcelier.jobs.trigger advisory:ingest advisory:readNormalised scope list extracted from token claims; (none) if the token carried none.
tenanttenant-defaultTenant claim extracted from the Authority token ((none) when the token lacked it).
bypassTrue / FalseIndicates whether the request succeeded because its source IP matched a bypass CIDR.
remote10.1.4.7Remote IP recorded from the connection / forwarded header test hooks.

Use your logging backend (e.g., Loki) to index the logger name and filter for suspicious combinations:

  • status=401 AND bypass=True – bypass network accepted an unauthenticated call (should be temporary during rollout).
  • status=202 AND scopes="(none)" – a token without scopes triggered a job; tighten client configuration.
  • status=202 AND NOT contains(scopes,"advisory:ingest") – ingestion attempted without the new AOC scopes; confirm the Authority client registration matches the sample above.
  • tenant!=(tenant-default) – indicates a cross-tenant token was accepted. Ensure Concelier requiredTenants is aligned with Authority client registration.
  • Spike in clientId="(none)" – indicates upstream Authority is not issuing client_id claims or the CLI is outdated.

2.2 Metrics

Concelier publishes counters under the OTEL meter StellaOps.Concelier.WebService.Jobs. Tags: job.kind, job.trigger, job.outcome.

Metric nameDescriptionPromQL example
web.jobs.triggeredAccepted job trigger requests.sum by (job_kind) (rate(web_jobs_triggered_total[5m]))
web.jobs.trigger.conflictRejected triggers (already running, disabled…).sum(rate(web_jobs_trigger_conflict_total[5m]))
web.jobs.trigger.failedServer-side job failures.sum(rate(web_jobs_trigger_failed_total[5m]))

Prometheus/OTEL collectors typically surface counters with _total suffix. Adjust queries to match your pipeline’s generated metric names.

Correlate audit logs with the following global meter exported via Concelier.SourceDiagnostics:

  • concelier.source.http.requests_total{concelier_source="jobs-run"} – ensures REST/manual triggers route through Authority.
  • If Grafana dashboards are deployed, extend the “Concelier Jobs” board with the above counters plus a table of recent audit log entries.

3. Alerting Guidance

  1. Unauthorized bypass attempt

    • Query: sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", status="401", bypass="True"}[5m])) > 0
    • Action: verify bypassNetworks list; confirm expected maintenance windows; rotate credentials if suspicious.
  2. Missing scopes

    • Query: sum(rate(log_messages_total{logger="Concelier.Authorization.Audit", scopes="(none)", status="200"}[5m])) > 0
    • Action: audit Authority client registration; ensure requiredScopes includes concelier.jobs.trigger, advisory:ingest, and advisory:read.
  3. Trigger failure surge

    • Query: sum(rate(web_jobs_trigger_failed_total[10m])) > 0 with severity warning if sustained for 10 minutes.
    • Action: inspect correlated audit entries and Concelier.Telemetry traces for job execution errors.
  4. Conflict spike

    • Query: sum(rate(web_jobs_trigger_conflict_total[10m])) > 5 (tune threshold).
    • Action: downstream scheduling may be firing repetitive triggers; ensure precedence is configured properly.
  5. Authority offline

    • Watch Concelier.Authorization.Audit logs for status=503 or status=500 along with clientId="(none)". Investigate Authority availability before re-enabling anonymous fallback.

4. Rollout & Verification Procedure

  1. Pre-checks

    • Align with the rollout phases documented in docs/10_CONCELIER_CLI_QUICKSTART.md (validation → rehearsal → enforced) and record the target dates in your change request.
    • Confirm allowAnonymousFallback is false in production; keep true only during staged validation.
    • Validate Authority issuer metadata is reachable from Concelier (curl https://authority.internal/.well-known/openid-configuration from the host).
  2. Smoke test with valid token

    • Obtain a token via CLI: stella auth login --scope "concelier.jobs.trigger advisory:ingest" --scope advisory:read.
    • Trigger a read-only endpoint: curl -H "Authorization: Bearer $TOKEN" https://concelier.internal/jobs/definitions.
    • Expect HTTP 200/202 and an audit log with bypass=False, scopes=concelier.jobs.trigger advisory:ingest advisory:read, and tenant=tenant-default.
  3. Negative test without token

    • Call the same endpoint without a token. Expect HTTP 401, bypass=False.
    • If the request succeeds, double-check bypassNetworks and ensure fallback is disabled.
  4. Bypass check (if applicable)

    • From an allowed maintenance IP, call /jobs/definitions without a token. Confirm the audit log shows bypass=True. Review business justification and expiry date for such entries.
  5. Metrics validation

    • Ensure web.jobs.triggered counter increments during accepted runs.
    • Exporters should show corresponding spans (concelier.job.trigger) if tracing is enabled.

5. Troubleshooting

SymptomProbable causeRemediation
Audit log shows clientId=(none) for all requestsAuthority not issuing client_id claim or CLI outdatedUpdate StellaOps Authority configuration (StellaOpsAuthorityOptions.Token.Claims.ClientId), or upgrade the CLI token acquisition flow.
Requests succeed with bypass=True unexpectedlyLocal network added to bypassNetworks or fallback still enabledRemove/adjust the CIDR list, disable anonymous fallback, restart Concelier.
HTTP 401 with valid tokenrequiredScopes missing from client registration or token audience mismatchVerify Authority client scopes (concelier.jobs.trigger) and ensure the token audience matches audiences config.
Metrics missing from PrometheusTelemetry exporters disabled or filter missing OTEL meterSet concelier.telemetry.enableMetrics=true, ensure collector includes StellaOps.Concelier.WebService.Jobs meter.
Sudden spike in web.jobs.trigger.failedDownstream job failure or Authority timeout mid-requestInspect Concelier job logs, re-run with tracing enabled, validate Authority latency.

6. References

  • docs/21_INSTALL_GUIDE.md – Authority configuration quick start.
  • docs/17_SECURITY_HARDENING_GUIDE.md – Security guardrails and enforcement deadlines.
  • docs/modules/authority/operations/monitoring.md – Authority-side monitoring and alerting playbook.
  • StellaOps.Concelier.WebService/Filters/JobAuthorizationAuditFilter.cs – source of audit log fields.