Raft Migration Plan — Multi-Region HA Control Plane

> Status: planned (post-MVP). The current control plane is a single Ktor > replica with a stub ComputeClusterService that always reports itself as the > LEADER. This document describes the migration path to a real Raft-based > multi-region high-availability deployment.

1. Current State

Topology: single Ktor server replica, single MongoDB instance.
Cluster service: ComputeClusterServiceImpl is a stub that:

- Always returns ClusterRole.LEADER from currentRole(). - Routes submitWrite(...) directly to the supplied applier (no replication). - Reports a one-node topology from environment variables (TANVRIT_REPLICA_ID, TANVRIT_REPLICA_REGION, TANVRIT_CLUSTER_ID).

Failure modes: any process / region outage takes the whole control plane

offline. Agents already have local retry/backoff so jobs in flight survive, but new submissions fail until the single replica recovers.

2. Target Architecture

A 3-replica region-aware Raft cluster fronting MongoDB:

                       ┌────────────────────────┐
                       │  Global LB (anycast)   │
                       └──────────┬─────────────┘
                                  │
        ┌─────────────────────────┼─────────────────────────┐
        │                         │                         │
  ┌─────▼─────┐             ┌─────▼─────┐             ┌─────▼─────┐
  │ replica-A │  Raft RPC   │ replica-B │  Raft RPC   │ replica-C │
  │ us-east-1 │◄───────────►│ us-west-2 │◄───────────►│ eu-west-1 │
  └─────┬─────┘             └─────┬─────┘             └─────┬─────┘
        │                         │                         │
        └─────────────► MongoDB replica set ◄────────────────┘
                         (region-replicated)

Quorum: 3 replicas → tolerates 1 failure. Five replicas (post-stable) →

tolerates 2 failures and gives a tighter election window.

Leader election: standard Raft. Election timeout target ≤ 5 s; heartbeat

interval 500 ms.

Library: Apache Ratis is the proposed Raft implementation.

Alternatives evaluated: Atomix (less actively maintained), Project Galois (too low-level for our timeline). Ratis is JVM-native, used by HDFS / Ozone, and exposes a clean RaftClient / StateMachine API that maps directly to our ComputeClusterService interface.

State machine: the Raft log entries are (operation, payload) pairs

matching ComputeClusterService.submitWrite(...). The state machine apply function calls the existing service-layer applier — i.e. the existing Mongo write — once a quorum has acked the entry.

Reads: all replicas serve reads directly from the local Mongo node. We

accept eventual consistency on reads in exchange for read scalability; strict-consistency reads can opt in via submitWrite(operation = "read", ...) which forces a leader round-trip.

MongoDB: must be externalized into a region-replicated MongoDB cluster

(Atlas or self-hosted replica set). Each Tanvrit replica connects to the nearest Mongo secondary for reads and the primary for writes the Raft log has committed.

3. Migration Steps

The migration is feature-flagged so we can deploy the new code paths without flipping any user-visible behavior. The flag is compute.cluster.raft and defaults to false (use the stub).

Introduce Apache Ratis as a dependency. Add to server/build.gradle.kts

under a constraint so the dependency is only resolved when the flag is on. Build a RatisComputeClusterService alongside the existing ComputeClusterServiceImpl. Wire selection in ServerModule.kt: ``kotlin single<ComputeClusterService> { if (System.getenv("TANVRIT_CLUSTER_RAFT") == "true") RatisComputeClusterService(...) else ComputeClusterServiceImpl() } ``

Externalize MongoDB into a region-replicated cluster. Provision a

3-region replica set (or migrate to MongoDB Atlas with a multi-region cluster). Verify reads from each region behave correctly under the existing stub before adding Raft on top.

Feature-flag compute.cluster.raft=true. Initially in a staging

environment only. Bring up 3 replicas in 3 regions, point them at the externalized Mongo. Verify topology endpoint (POST /api/compute/cluster/topology) reports all three.

Deploy 3 replicas to production behind anycast LB. Replicas serve reads

immediately. Writes are still going to a single MongoDB primary; the Raft log is only journaling them at this stage.

Verify failover. Kill the LEADER replica and confirm:

- A new leader is elected within the election timeout (< 5 s). - In-flight writes either complete on the new leader or surface a clear 503 Service Unavailable to clients. The agent's existing retry loop handles 503s with exponential backoff. - Agents reconnect through anycast and resume polling with no manual action.

Flip the flag in production. From this point on, every write goes

through the Raft log before touching MongoDB. The stub ComputeClusterServiceImpl is retired but kept in the tree as the reference single-node implementation for local development and tests.

4. Read vs Write Routing

The interface deliberately distinguishes:

Reads (currentRole, currentLeaderId, clusterTopology, plus all

read-only repository calls in service layer) — work from any replica. The HTTP layer does not need to know whether it's hitting a leader or a follower.

Writes (everything that mutates state) — must go through

ComputeClusterService.submitWrite(...). The applier closure is the existing per-feature service code; Raft replication is transparent to the call site.

For example, job creation evolves from:

// Before:
jobRepository.insertJob(job)

// After:
clusterService.submitWrite("compute.job.create", job) {
    jobRepository.insertJob(job)
}

The call site is identical regardless of which ComputeClusterService implementation is bound — that's the entire point of the interface.

5. Read Paths That Work From Any Replica

These endpoints are explicitly safe to serve from a follower without any leader round-trip:

POST /api/compute/jobs/list — job listings
POST /api/compute/jobs/:id/logs — log streaming (logs are append-only and

read from the same Mongo replica set)

POST /api/compute/nodes/list — node listings
POST /api/compute/cluster/topology — cluster introspection (this file's

endpoint)

POST /api/compute/metrics — aggregated metrics (eventually consistent)

Write paths that must go through the leader once Raft is enabled:

Job submission, cancellation, retry
Node registration, draining, deletion
API key creation/revocation
Credit ledger entries
Schedule create/update/delete

6. Open Questions / Future Work

Read-your-writes consistency: clients that submit a write and immediately

list jobs may hit a follower that hasn't applied the entry yet. Mitigation: return the leader's commit index in the write response, and the client passes it in the next read so a follower can wait until its applied index ≥ that value. This is a v2 enhancement; v1 accepts eventual consistency.

Multi-region writes: all writes are still serialized through a single

leader, so cross-region write latency dominates. v2 should explore multi-Raft (one Raft group per shard, e.g. one per business) so US writes hit a US leader and EU writes hit an EU leader.

Snapshotting: Ratis ships with snapshot support; we'll snapshot every

10 000 entries and ship snapshots over the same channel as AppendEntries.

Observability: export Raft metrics (term, commit index, applied index,

leader id) to Prometheus alongside the existing service metrics.