Raft Migration Plan — Multi-Region HA Control Plane
> Status: planned (post-MVP). The current control plane is a single Ktor > replica with a stub ComputeClusterService that always reports itself as the > LEADER. This document describes the migration path to a real Raft-based > multi-region high-availability deployment.
1. Current State
- Topology: single Ktor server replica, single MongoDB instance.
- Cluster service:
ComputeClusterServiceImplis a stub that: - Failure modes: any process / region outage takes the whole control plane
- Always returns ClusterRole.LEADER from currentRole(). - Routes submitWrite(...) directly to the supplied applier (no replication). - Reports a one-node topology from environment variables (TANVRIT_REPLICA_ID, TANVRIT_REPLICA_REGION, TANVRIT_CLUSTER_ID).
offline. Agents already have local retry/backoff so jobs in flight survive, but new submissions fail until the single replica recovers.
2. Target Architecture
A 3-replica region-aware Raft cluster fronting MongoDB:
┌────────────────────────┐
│ Global LB (anycast) │
└──────────┬─────────────┘
│
┌─────────────────────────┼─────────────────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ replica-A │ Raft RPC │ replica-B │ Raft RPC │ replica-C │
│ us-east-1 │◄───────────►│ us-west-2 │◄───────────►│ eu-west-1 │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└─────────────► MongoDB replica set ◄────────────────┘
(region-replicated)
- Quorum: 3 replicas → tolerates 1 failure. Five replicas (post-stable) →
- Leader election: standard Raft. Election timeout target ≤ 5 s; heartbeat
- Library: Apache Ratis is the proposed Raft implementation.
- State machine: the Raft log entries are
(operation, payload)pairs - Reads: all replicas serve reads directly from the local Mongo node. We
- MongoDB: must be externalized into a region-replicated MongoDB cluster
tolerates 2 failures and gives a tighter election window.
interval 500 ms.
Alternatives evaluated: Atomix (less actively maintained), Project Galois (too low-level for our timeline). Ratis is JVM-native, used by HDFS / Ozone, and exposes a clean RaftClient / StateMachine API that maps directly to our ComputeClusterService interface.
matching ComputeClusterService.submitWrite(...). The state machine apply function calls the existing service-layer applier — i.e. the existing Mongo write — once a quorum has acked the entry.
accept eventual consistency on reads in exchange for read scalability; strict-consistency reads can opt in via submitWrite(operation = "read", ...) which forces a leader round-trip.
(Atlas or self-hosted replica set). Each Tanvrit replica connects to the nearest Mongo secondary for reads and the primary for writes the Raft log has committed.
3. Migration Steps
The migration is feature-flagged so we can deploy the new code paths without flipping any user-visible behavior. The flag is compute.cluster.raft and defaults to false (use the stub).
- Introduce Apache Ratis as a dependency. Add to
server/build.gradle.kts - Externalize MongoDB into a region-replicated cluster. Provision a
- Feature-flag
compute.cluster.raft=true. Initially in a staging - Deploy 3 replicas to production behind anycast LB. Replicas serve reads
- Verify failover. Kill the LEADER replica and confirm:
- Flip the flag in production. From this point on, every write goes
under a constraint so the dependency is only resolved when the flag is on. Build a RatisComputeClusterService alongside the existing ComputeClusterServiceImpl. Wire selection in ServerModule.kt: ``kotlin single<ComputeClusterService> { if (System.getenv("TANVRIT_CLUSTER_RAFT") == "true") RatisComputeClusterService(...) else ComputeClusterServiceImpl() } ``
3-region replica set (or migrate to MongoDB Atlas with a multi-region cluster). Verify reads from each region behave correctly under the existing stub before adding Raft on top.
environment only. Bring up 3 replicas in 3 regions, point them at the externalized Mongo. Verify topology endpoint (POST /api/compute/cluster/topology) reports all three.
immediately. Writes are still going to a single MongoDB primary; the Raft log is only journaling them at this stage.
- A new leader is elected within the election timeout (< 5 s). - In-flight writes either complete on the new leader or surface a clear 503 Service Unavailable to clients. The agent's existing retry loop handles 503s with exponential backoff. - Agents reconnect through anycast and resume polling with no manual action.
through the Raft log before touching MongoDB. The stub ComputeClusterServiceImpl is retired but kept in the tree as the reference single-node implementation for local development and tests.
4. Read vs Write Routing
The interface deliberately distinguishes:
- Reads (
currentRole,currentLeaderId,clusterTopology, plus all - Writes (everything that mutates state) — must go through
read-only repository calls in service layer) — work from any replica. The HTTP layer does not need to know whether it's hitting a leader or a follower.
ComputeClusterService.submitWrite(...). The applier closure is the existing per-feature service code; Raft replication is transparent to the call site.
For example, job creation evolves from:
// Before:
jobRepository.insertJob(job)
// After:
clusterService.submitWrite("compute.job.create", job) {
jobRepository.insertJob(job)
}
The call site is identical regardless of which ComputeClusterService implementation is bound — that's the entire point of the interface.
5. Read Paths That Work From Any Replica
These endpoints are explicitly safe to serve from a follower without any leader round-trip:
POST /api/compute/jobs/list— job listingsPOST /api/compute/jobs/:id/logs— log streaming (logs are append-only andPOST /api/compute/nodes/list— node listingsPOST /api/compute/cluster/topology— cluster introspection (this file'sPOST /api/compute/metrics— aggregated metrics (eventually consistent)
read from the same Mongo replica set)
endpoint)
Write paths that must go through the leader once Raft is enabled:
- Job submission, cancellation, retry
- Node registration, draining, deletion
- API key creation/revocation
- Credit ledger entries
- Schedule create/update/delete
6. Open Questions / Future Work
- Read-your-writes consistency: clients that submit a write and immediately
- Multi-region writes: all writes are still serialized through a single
- Snapshotting: Ratis ships with snapshot support; we'll snapshot every
- Observability: export Raft metrics (term, commit index, applied index,
list jobs may hit a follower that hasn't applied the entry yet. Mitigation: return the leader's commit index in the write response, and the client passes it in the next read so a follower can wait until its applied index ≥ that value. This is a v2 enhancement; v1 accepts eventual consistency.
leader, so cross-region write latency dominates. v2 should explore multi-Raft (one Raft group per shard, e.g. one per business) so US writes hit a US leader and EU writes hit an EU leader.
10 000 entries and ship snapshots over the same channel as AppendEntries.
leader id) to Prometheus alongside the existing service metrics.