Architecture
┌─────────────────────────────────────────────────────────────┐
│ Portals (Compose Multiplatform + Next.js companion) │
│ /compute/composeApp + /compute/web │
└────────────────┬────────────────────────────────────────────┘
│ uses
▼
┌─────────────────────────────────────────────────────────────┐
│ :compute SDK (KMP) — /sdk/compute │
│ feature/{job,node,execution,template,credit,schedule} │
│ model / event / network / repository / handler / viewModel │
└────────────────┬────────────────────────────────────────────┘
│ HTTPS (REST + SSE)
▼
┌─────────────────────────────────────────────────────────────┐
│ Control plane — Ktor server (/server/feature/compute) │
│ services: Job, Node, NodeAdmin, Scheduler, Log (SSE), │
│ Template, Schedule, Credit, Retry │
│ repositories: MongoDB (compute_jobs, compute_nodes, │
│ compute_executions, compute_templates, │
│ compute_schedules, compute_credit_*) │
│ scheduled tasks (runs inside Ktor JobScheduler): │
│ compute-dispatch 5s queue → ASSIGNED │
│ compute-timeout 30s RUNNING past deadline │
│ compute-stale-nodes 30s heartbeat gap → OFFLINE │
│ compute-schedules 60s cron due → submit job │
│ compute-retry 30s FAILED → QUEUED (auto) │
│ compute-template-seed 24h seed builtin templates │
└────────────────┬────────────────────────────────────────────┘
│ agent protocol (REST, X-API-Key auth)
▼
┌─────────────────────────────────────────────────────────────┐
│ Agent binary — /compute-agent → tanvrit-agent │
│ register → heartbeat → poll → execute → submitResult │
│ executors: Shell, Docker, GithubAction │
│ detectors: GPU (CUDA / ROCm / Metal), Docker, Android SDK │
└─────────────────────────────────────────────────────────────┘
Data model
MongoDB collections
| Collection | Primary key | Purpose | |--- |--- |--- | | compute_jobs | job_id | Every submission; lifecycle QUEUED→ASSIGNED→RUNNING→SUCCESS/FAILED/TIMEOUT/CANCELLED | | compute_nodes | node_id | Registered agent fleet. Label-matched. BCrypt-hashed API keys. | | compute_executions | id | One row per attempt of a job. Captures stdout/stderr/CPU/MEM per try. | | compute_templates | template_id | Fork-and-run job templates (builtin + user-owned) | | compute_schedules | schedule_id | Cron-triggered recurring jobs | | compute_credit_balances| business_id+user_id | Free-tier + paid balance. $30/month auto-granted. | | compute_credit_ledger | entry_id | Append-only credit ledger | | compute_job_logs | _id | (optional) persisted log lines for offline replay |
Job lifecycle
submit ┌────────┐ dispatch ┌──────────┐ poll ┌─────────┐
─────► │ QUEUED │ ─────────────────► │ ASSIGNED │ ─────────────► │ RUNNING │
└────────┘ └──────────┘ └────┬────┘
│ submitResult
▼
┌─────────┐ ┌─────────┐
│ SUCCESS │ │ FAILED │
└─────────┘ └────┬────┘
│ ComputeRetryService
│ (if retryCount < maxRetries)
▼
┌────────┐
│ QUEUED │
└────────┘
Auth
- User endpoints (
/api/compute/jobs/*,/api/compute/templates/*, etc.) — JWT via - Agent endpoints (
/api/compute/nodes/{register,heartbeat,poll,result,logs}) —X-API-Key - Rotating keys —
POST /api/compute/nodes/rotate-keyissues a new raw key and updates
auth-jwt Ktor plugin. Principal exposes userId and businessId claims.
header. Keys are BCrypt-hashed in the node record and validated on every call.
the hash. Old keys are invalidated immediately.
Log streaming
The log path is decoupled from persistence so real-time tailing stays fast:
- Agent calls
POST /api/compute/nodes/logswith a batch of lines. ComputeNodeServiceImpl.pushLogsforwards the batch intoComputeLogServiceImpl.publish(jobId, lines).ComputeLogServiceImplholds aMutableSharedFlow<String>per job (plus a 500-line tail- The SSE endpoint
GET /api/compute/jobs/{jobId}/logs/streamwrites recent tail + new
buffer for late subscribers).
lines as they arrive. Uses DROP_OLDEST on overflow so slow clients never back-pressure the publish path.
For a multi-instance HA deployment this is swapped for Redis pub/sub via a drop-in replacement of ComputeLogService in ServerModule.kt.