Console Observability

Audience: Observability Guild, Console Guild, SRE/operators.
Scope: Metrics, logs, traces, dashboards, alerting, feature flags, and offline workflows for the StellaOps Console (Sprint 23).
Prerequisites: Console deployed with metrics enabled (CONSOLE_METRICS_ENABLED=true) and OTLP exporters configured (OTEL_EXPORTER_OTLP_*).


1 · Instrumentation Overview

  • Telemetry stack: OpenTelemetry Web SDK (browser) + Console telemetry bridge → OTLP collector (Tempo/Prometheus/Loki). Server-side endpoints expose /metrics (Prometheus) and /health/*.
  • Sampling: Front-end spans sample at 5 % by default (OTEL_TRACES_SAMPLER=parentbased_traceidratio). Metrics are un-sampled; log sampling is handled per category (§3).
  • Correlation IDs: Every API call carries x-stellaops-correlation-id; structured UI events mirror that value so operators can follow a request across gateway, backend, and UI.
  • Scope gating: Operators need the ui.telemetry scope to view live charts in the Admin workspace; the scope also controls access to /console/telemetry SSE streams.

2 · Metrics

2.1 Experience & Navigation

MetricTypeLabelsNotes
ui_route_render_secondsHistogramroute, tenant, device (desktop,tablet)Time between route activation and first interactive paint. Target P95 ≤ 1.5 s (cached).
ui_request_duration_secondsHistogramservice, method, status, tenantGateway proxy timing for backend calls performed by the console. Alerts when backend latency degrades.
ui_filter_apply_totalCounterroute, filter, tenantIncrements when a global filter or context chip is applied. Used to track adoption of saved views.
ui_tenant_switch_totalCounterfromTenant, toTenant, trigger (picker, shortcut, link)Emitted after a successful tenant switch; correlates with Authority ui.tenant.switch logs.
ui_offline_banner_secondsHistogramreason (authority, manifest, gateway), tenantDuration of offline banner visibility; integrate with air-gap SLAs.

2.2 Security & Session

MetricTypeLabelsNotes
ui_dpop_failure_totalCounterendpoint, reason (nonce, jkt, clockSkew)Raised when DPoP validation fails; pair with Authority audit trail.
ui_fresh_auth_prompt_totalCounteraction (token.revoke, policy.activate, client.create), tenantCounts fresh-auth modals; backlog above baseline indicates workflow friction.
ui_fresh_auth_failure_totalCounteraction, reason (timeout,cancelled,auth_error)Optional metric (set CONSOLE_FRESH_AUTH_METRICS=true when feature flag lands).

2.3 Downloads & Offline Kit

MetricTypeLabelsNotes
ui_download_manifest_refresh_secondsHistogramtenant, channel (edge,stable,airgap)Time to fetch and verify downloads manifest. Target < 3 s.
ui_download_export_queue_depthGaugetenant, artifactType (sbom,policy,attestation,console)Mirrors /console/downloads queue depth; triggers when offline bundles lag.
ui_download_command_copied_totalCountertenant, artifactTypeIncrements when users copy CLI commands from the UI. Useful to observe CLI parity adoption.

2.4 Telemetry Emission & Errors

MetricTypeLabelsNotes
ui_telemetry_batch_failures_totalCountertransport (otlp-http,otlp-grpc), reasonEmitted by OTLP bridge when batches fail. Enable via CONSOLE_METRICS_VERBOSE=true.
ui_telemetry_queue_depthGaugepriority (normal,high), tenantBrowser-side buffer depth; monitor for spikes under degraded collectors.

Scraping tips:

  • Enable /metrics via CONSOLE_METRICS_ENABLED=true.
  • Set OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.collector:4318 and relevant headers (OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>).
  • For air-gapped sites, point the exporter to the Offline Kit collector (localhost:4318) and forward the metrics snapshot using stella offline bundle metrics.

3 · Logs

  • Format: JSON via Console log bridge; emitted to stdout and optional OTLP log exporter. Core fields: timestamp, level, action, route, tenant, subject, correlationId, dpop.jkt, device, offlineMode.
  • Categories:
    • ui.action – general user interactions (route changes, command palette, filter updates). Sampled 50 % by default; override with feature flag telemetry.logVerbose.
    • ui.tenant.switch – always logged; includes fromTenant, toTenant, tokenId, and Authority audit correlation.
    • ui.download.commandCopied – download commands copied; includes artifactId, digest, manifestVersion.
    • ui.security.anomaly – DPoP mismatches, tenant header errors, CSP violations (level = Warning).
    • ui.telemetry.failure – OTLP export errors; include httpStatus, batchSize, retryCount.
  • PII handling: Full emails are scrubbed; only hashed values (user:<sha256>) appear unless ui.admin + fresh-auth were granted for the action (still redacted in logs).
  • Retention: Recommended 14 days for connected sites, 30 days for sealed/air-gap audits. Ship logs to Loki/Elastic with ingest label service="stellaops-web-ui".

4 · Traces

  • Span names & attributes:
    • ui.route.transition – wraps route navigation; attributes: route, tenant, renderMillis, prefetchHit.
    • ui.api.fetch – HTTP fetch to backend; attributes: service, endpoint, status, networkTime.
    • ui.sse.stream – Server-sent event subscriptions (status ticker, runs); attributes: channel, connectedMillis, reconnects.
    • ui.telemetry.batch – Browser OTLP flush; attributes: batchSize, success, retryCount.
    • ui.policy.action – Policy workspace actions (simulate, approve, activate) per docs/ui/policy-editor.md.
  • Propagation: Spans use W3C traceparent; gateway echoes header to backend APIs so traces stitch across UI → gateway → service.
  • Sampling controls: OTEL_TRACES_SAMPLER_ARG (ratio) and feature flag telemetry.forceSampling (sets to 100 % for incident debugging).
  • Viewing traces: Grafana Tempo or Jaeger via collector. Filter by service.name = stellaops-console. For cross-service debugging, filter on correlationId and tenant.

5 · Dashboards

5.1 Experience Overview

Panels:

  • Route render histogram (P50/P90/P99) by route.
  • Backend call latency stacked by service (ui_request_duration_seconds).
  • Offline banner duration trend (ui_offline_banner_seconds).
  • Tenant switch volume vs failure rate (overlay ui_dpop_failure_total).
  • Command palette usage (ui_filter_apply_total + ui.action log counts).

5.2 Downloads & Offline Kit

  • Manifest refresh time chart (per channel).
  • Export queue depth gauge with alert thresholds.
  • CLI command adoption (bar chart per artifact type, using ui_download_command_copied_total).
  • Offline parity banner occurrences (downloads.offlineParity flag from API → derived metric).
  • Last Offline Kit import timestamp (join with Downloads API metadata).

5.3 Security & Session

  • Fresh-auth prompt counts vs success/fail ratios.
  • DPoP failure stacked by reason.
  • Tenant mismatch warnings (from ui.security.anomaly logs).
  • Scope usage heatmap (derived from Authority audit events + UI logs).
  • CSP violation counts (browser securitypolicyviolation listener forwarded to logs).

Capture screenshots for Grafana once dashboards stabilise (docs/assets/ui/observability/*.png). Replace placeholders before releasing the doc.


6 · Alerting

AlertConditionSuggested Action
ConsoleLatencyHighui_route_render_seconds_bucket{le="1.5"} drops below 0.95 for 3 intervalsInspect route splits, check backend latencies, review CDN cache.
BackendLatencyHighui_request_duration_seconds_sum / ui_request_duration_seconds_count > 1 s for any serviceCorrelate with gateway/service dashboards; escalate to owning guild.
TenantSwitchFailuresIncrease in ui_dpop_failure_total or ui.security.anomaly (tenant mismatch) > 3/minValidate Authority issuer, check clock skew, confirm tenant config.
FreshAuthLoopui_fresh_auth_prompt_total spikes with matching ui_fresh_auth_failure_totalReview Authority /fresh-auth endpoint, session timeout config, UX regressions.
OfflineBannerLongui_offline_banner_seconds P95 > 120 sInvestigate Authority/gateway availability; verify Offline Kit freshness.
DownloadsBacklogui_download_export_queue_depth > 5 for 10 min OR queue age > alert thresholdPing Downloads service, ensure manifest pipeline (DOWNLOADS-CONSOLE-23-001) is healthy.
TelemetryExportErrorsui_telemetry_batch_failures_total > 0 for ≥5 minCheck collector health, credentials, or TLS trust.

Integrate alerts with Notifier (ui.alerts) or existing Ops channels. Tag incidents with component=console for correlation.


7 · Feature Flags & Configuration

Flag / Env VarPurposeDefault
CONSOLE_FEATURE_FLAGSEnables UI modules (runs, downloads, policies, telemetry). Telemetry panel requires telemetry.runs,downloads,policies
CONSOLE_METRICS_ENABLEDExposes /metrics for Prometheus scrape.true
CONSOLE_METRICS_VERBOSEEmits additional batching metrics (ui_telemetry_*).false
CONSOLE_LOG_LEVELMinimum log level (Information, Debug). Use Debug for incident sampling.Information
CONSOLE_METRICS_SAMPLING (planned)Controls front-end span sampling ratio. Document once released.0.05
OTEL_EXPORTER_OTLP_ENDPOINTCollector URL; supports HTTPS.unset
OTEL_EXPORTER_OTLP_HEADERSComma-separated headers (auth).unset
OTEL_EXPORTER_OTLP_INSECUREAllow HTTP (dev only).false
OTEL_SERVICE_NAMEService tag for traces/logs. Set to stellaops-console.auto
CONSOLE_TELEMETRY_SSE_ENABLEDEnables /console/telemetry SSE feed for dashboards.true

Feature flag changes should be tracked in release notes and mirrored in /docs/ui/navigation.md (shortcuts may change when modules toggle).


8 · Offline / Air-Gapped Workflow

  • Mirror the console image and telemetry collector as part of the Offline Kit (see /docs/install/docker.md §4).
  • Scrape metrics locally via curl -k https://console.local/metrics > metrics.prom; archive alongside logs for audits.
  • Use stella offline kit import to keep the downloads manifest in sync; dashboards display staleness using ui_download_manifest_refresh_seconds.
  • When collectors are unavailable, console queues OTLP batches (up to 5 min) and exposes backlog through ui_telemetry_queue_depth; export queue metrics to prove no data loss.
  • After reconnecting, run stella console status --telemetry (CLI parity pending; see DOCS-CONSOLE-23-014) or verify ui_telemetry_batch_failures_total resets to zero.
  • Retain telemetry bundles for 30 days per compliance guidelines; include Grafana JSON exports in audit packages.

9 · Compliance Checklist

  • [ ] /metrics scraped in staging & production; dashboards display ui_route_render_seconds, ui_request_duration_seconds, and downloads metrics.
  • [ ] OTLP traces/logs confirmed end-to-end (collector, Tempo/Loki).
  • [ ] Alert rules from §6 implemented in monitoring stack with runbooks linked.
  • [ ] Feature flags documented and change-controlled; telemetry disabled only with approval.
  • [ ] DPoP/fresh-auth anomalies correlated with Authority audit logs during drill.
  • [ ] Offline capture workflow exercised; evidence stored in audit vault.
  • [ ] Screenshots of Grafana dashboards committed once they stabilise (update references).
  • [ ] Cross-links verified (docs/deploy/console.md, docs/security/console-security.md, docs/ui/downloads.md, docs/ui/console-overview.md).

10 · References

  • /docs/deploy/console.md – Metrics endpoint, OTLP config, health checks.
  • /docs/security/console-security.md – Security metrics & alert hints.
  • /docs/ui/console-overview.md – Telemetry primitives and performance budgets.
  • /docs/ui/downloads.md – Downloads metrics and parity workflow.
  • /docs/observability/observability.md – Platform-wide practices.
  • /ops/telemetry-collector.md & /ops/telemetry-storage.md – Collector deployment.
  • /docs/install/docker.md – Compose/Helm environment variables.

Last updated: 2025-10-28 (Sprint 23).