Tanvrit Compute

Architecture

┌─────────────────────────────────────────────────────────────┐
│  Portals (Compose Multiplatform + Next.js companion)        │
│  /compute/composeApp + /compute/web                          │
└────────────────┬────────────────────────────────────────────┘
                 │ uses
                 ▼
┌─────────────────────────────────────────────────────────────┐
│  :compute SDK (KMP) — /sdk/compute                           │
│  feature/{job,node,execution,template,credit,schedule}       │
│  model / event / network / repository / handler / viewModel  │
└────────────────┬────────────────────────────────────────────┘
                 │ HTTPS (REST + SSE)
                 ▼
┌─────────────────────────────────────────────────────────────┐
│  Control plane — Ktor server (/server/feature/compute)       │
│  services: Job, Node, NodeAdmin, Scheduler, Log (SSE),       │
│            Template, Schedule, Credit, Retry                 │
│  repositories: MongoDB (compute_jobs, compute_nodes,         │
│                compute_executions, compute_templates,        │
│                compute_schedules, compute_credit_*)          │
│  scheduled tasks (runs inside Ktor JobScheduler):            │
│    compute-dispatch         5s    queue → ASSIGNED           │
│    compute-timeout          30s   RUNNING past deadline      │
│    compute-stale-nodes      30s   heartbeat gap → OFFLINE    │
│    compute-schedules        60s   cron due → submit job      │
│    compute-retry            30s   FAILED → QUEUED (auto)     │
│    compute-template-seed    24h   seed builtin templates     │
└────────────────┬────────────────────────────────────────────┘
                 │ agent protocol (REST, X-API-Key auth)
                 ▼
┌─────────────────────────────────────────────────────────────┐
│  Agent binary — /compute-agent → tanvrit-agent               │
│  register → heartbeat → poll → execute → submitResult        │
│  executors: Shell, Docker, GithubAction                      │
│  detectors: GPU (CUDA / ROCm / Metal), Docker, Android SDK   │
└─────────────────────────────────────────────────────────────┘

Data model

MongoDB collections

| Collection | Primary key | Purpose | |--- |--- |--- | | compute_jobs | job_id | Every submission; lifecycle QUEUED→ASSIGNED→RUNNING→SUCCESS/FAILED/TIMEOUT/CANCELLED | | compute_nodes | node_id | Registered agent fleet. Label-matched. BCrypt-hashed API keys. | | compute_executions | id | One row per attempt of a job. Captures stdout/stderr/CPU/MEM per try. | | compute_templates | template_id | Fork-and-run job templates (builtin + user-owned) | | compute_schedules | schedule_id | Cron-triggered recurring jobs | | compute_credit_balances| business_id+user_id | Free-tier + paid balance. $30/month auto-granted. | | compute_credit_ledger | entry_id | Append-only credit ledger | | compute_job_logs | _id | (optional) persisted log lines for offline replay |

Job lifecycle

submit        ┌────────┐  dispatch          ┌──────────┐  poll          ┌─────────┐
  ─────►      │ QUEUED │ ─────────────────► │ ASSIGNED │ ─────────────► │ RUNNING │
              └────────┘                    └──────────┘                └────┬────┘
                                                                             │ submitResult
                                                                             ▼
                                                        ┌─────────┐   ┌─────────┐
                                                        │ SUCCESS │   │ FAILED  │
                                                        └─────────┘   └────┬────┘
                                                                           │ ComputeRetryService
                                                                           │ (if retryCount < maxRetries)
                                                                           ▼
                                                                      ┌────────┐
                                                                      │ QUEUED │
                                                                      └────────┘

Auth

  • User endpoints (/api/compute/jobs/*, /api/compute/templates/*, etc.) — JWT via
  • auth-jwt Ktor plugin. Principal exposes userId and businessId claims.

  • Agent endpoints (/api/compute/nodes/{register,heartbeat,poll,result,logs}) — X-API-Key
  • header. Keys are BCrypt-hashed in the node record and validated on every call.

  • Rotating keysPOST /api/compute/nodes/rotate-key issues a new raw key and updates
  • the hash. Old keys are invalidated immediately.

Log streaming

The log path is decoupled from persistence so real-time tailing stays fast:

  1. Agent calls POST /api/compute/nodes/logs with a batch of lines.
  2. ComputeNodeServiceImpl.pushLogs forwards the batch into ComputeLogServiceImpl.publish(jobId, lines).
  3. ComputeLogServiceImpl holds a MutableSharedFlow<String> per job (plus a 500-line tail
  4. buffer for late subscribers).

  5. The SSE endpoint GET /api/compute/jobs/{jobId}/logs/stream writes recent tail + new
  6. lines as they arrive. Uses DROP_OLDEST on overflow so slow clients never back-pressure the publish path.

For a multi-instance HA deployment this is swapped for Redis pub/sub via a drop-in replacement of ComputeLogService in ServerModule.kt.