Source & Job Orchestrator architecture
Based on Epic 9 – Source & Job Orchestrator Dashboard; this section outlines components, job lifecycle, rate-limit governance, and observability.
1) Topology
- Orchestrator API (
StellaOps.Orchestrator). Minimal API providing job state, throttling controls, replay endpoints, and dashboard data. Authenticated via Authority scopes (orchestrator:*). - Job ledger (Mongo). Collections
jobs,job_history,sources,quotas,throttles,incidents. Append-only history ensures auditability. - Queue abstraction. Supports Mongo queue, Redis Streams, or NATS JetStream (pluggable). Each job carries lease metadata and retry policy.
- Dashboard feeds. SSE/GraphQL endpoints supply Console UI with job timelines, throughput, error distributions, and rate-limit status.
2) Job lifecycle
- Enqueue. Producer services (Concelier, Excititor, Scheduler, Export Center, Policy Engine) submit
JobRequestrecords containingjobType,tenant,priority,payloadDigest,dependencies. - Scheduling. Orchestrator applies quotas and rate limits per
{tenant, jobType}. Jobs exceeding limits are staged in pending queue with next eligible timestamp. - Leasing (Task Runner bridge). Workers poll
LeaseJobendpoint; Orchestrator returns job withleaseId,leaseUntil,idempotencyKey, and instrumentation tokens. Lease renewal required for long-running tasks; leases carry retry hints and provenance (tenant,project,correlationId,taskRunnerId). - Completion. Worker reports status (
succeeded,failed,canceled,timed_out). On success the job is archived; on failure Orchestrator applies retry policy (exponential backoff, max attempts). Incidents escalate to Ops if thresholds exceeded. - Replay. Operators trigger
POST /jobs/{id}/replaywhich clones job payload, setsreplayOfpointer, and requeues with high priority while preserving determinism metadata.
Pack-run lifecycle (phase III)
- Register
pack-runjob type with task runner hints (artifacts, log channel, heartbeat cadence). - Logs/Artifacts: SSE/WS stream keyed by
packRunId+tenant/project; artifacts published with content digests and URI metadata. - Events: notifier payloads include envelope provenance (tenant, project, correlationId, idempotencyKey) pending ORCH-SVC-37-101 final spec.
3) Rate-limit & quota governance
- Quotas defined per tenant/profile (
maxActive,maxPerHour,burst). Stored inquotasand enforced before leasing. - Dynamic throttles allow ops to pause specific sources (
pauseSource,resumeSource) or reduce concurrency. - Circuit breakers automatically pause job types when failure rate > configured threshold; incidents generated via Notify and Observability stack.
- Control plane quota updates require Authority scope
orch:quota(issued viaOrch.Adminrole). Historical rebuilds/backfills additionally requireorch:backfilland must supplybackfill_reasonandbackfill_ticketalongside the operator metadata. Authority persists all four fields (quota_reason,quota_ticket,backfill_reason,backfill_ticket) for audit replay.
4) APIs
GET /api/jobs?status=— list jobs with filters (tenant, jobType, status, time window).GET /api/jobs/{id}— job detail (payload digest, attempts, worker, lease history, metrics).POST /api/jobs/{id}/cancel— cancel running/pending job with audit reason.POST /api/jobs/{id}/replay— schedule replay.POST /api/limits/throttle— apply throttle (requires elevated scope).GET /api/dashboard/metrics— aggregated metrics for Console dashboards.- Event envelope draft (
docs/modules/orchestrator/event-envelope.md) defines notifier/webhook/SSE payloads with idempotency keys, provenance, and task runner metadata for job/pack-run events.
All responses include deterministic timestamps, job digests, and DSSE signature fields for offline reconciliation.
5) Observability
- Metrics:
job_queue_depth{jobType,tenant},job_latency_seconds,job_failures_total,job_retry_total,lease_extensions_total. - Task Runner bridge adds
pack_run_logs_stream_lag_seconds,pack_run_heartbeats_total,pack_run_artifacts_total. - Logs: structured with
jobId,jobType,tenant,workerId,leaseId,status. Incident logs flagged for Ops. - Traces: spans covering
enqueue,schedule,lease,worker_execute,complete. Trace IDs propagate to worker spans for end-to-end correlation.
6) Offline support
- Orchestrator exports audit bundles:
jobs.jsonl,history.jsonl,throttles.jsonl,manifest.json,signatures/. Used for offline investigations and compliance. - Replay manifests contain job digests and success/failure notes for deterministic proof.
7) Operational considerations
- HA deployment with multiple API instances; queue storage determines redundancy strategy.
- Support for
maintenancemode halting leases while allowing status inspection. - Runbook includes procedures for expanding quotas, blacklisting misbehaving tenants, and recovering stuck jobs (clearing leases, applying pause/resume).