multi-master explorer support for high availability #44

New issue

Open

opened 2026-03-24 13:33:51 +00:00 by mahmoud · 0 comments

mahmoud commented

2026-03-24 13:33:51 +00:00

Owner

Description

Currently hero_compute supports only a single master (explorer) node. If the master goes down, the explorer is unavailable and workers cannot register or have their VMs managed remotely. We need multi-master support for high availability.

Current Architecture

1 master runs the explorer with a local OSIS database
N workers send heartbeats to the single master
All VM proxy operations route through that single explorer
If the master dies, the explorer is unavailable

Proposed Architecture

What Already Works

EXPLORER_ADDRESSES env var accepts comma-separated addresses
Heartbeat sender already iterates over multiple explorer addresses (sends to all)
Proxy architecture is stateless — any explorer with the node registry could proxy
start.sh worker mode already supports multiple master addresses conceptually

What Needs To Be Built

Phase 1: State replication between explorers

Explorer-to-explorer heartbeat/sync protocol
Replicate ExplorerNode records between master instances
Conflict resolution strategy (last-write-wins by last_seen timestamp, or CRDT)
New mode: make start MODE=master PEER_MASTERS=,

Phase 2: Worker multi-master awareness

Workers send heartbeats to all configured masters (already partially done)
Workers failover to next master if primary is unreachable
UI proxy failover: if /explorer/rpc fails on one master, try the next

Phase 3: Consistency and health

Split-brain detection (two masters with conflicting node state)
Master health monitoring (masters monitor each other)
Stale master cleanup (if a master goes offline, its exclusive state is adopted)

Design Considerations

State model: Explorer state is small (node metadata + stats). Full replication is feasible.
No shared database needed: Peer-to-peer replication between explorers is simpler than adding an external DB.
Consistency level: Eventual consistency is acceptable — node stats are already approximate (heartbeat-based).
VM operations: VM proxy calls are stateless (forwarded to the compute node). Any master with the node's socket_path can
proxy.
Backwards compatible: Single-master mode must continue to work without configuration changes.

Acceptance Criteria

Two master nodes can run simultaneously, each aware of all worker nodes
Workers register with both masters via heartbeats
UI on either master shows all nodes and VMs
If one master goes down, the other continues serving
make start MODE=master PEER_MASTERS= configures peering
Single-master mode (no PEER_MASTERS) works exactly as today

### Description Currently hero_compute supports only a single master (explorer) node. If the master goes down, the explorer is unavailable and workers cannot register or have their VMs managed remotely. We need multi-master support for high availability. **Current Architecture** - 1 master runs the explorer with a local OSIS database - N workers send heartbeats to the single master - All VM proxy operations route through that single explorer - If the master dies, the explorer is unavailable **Proposed Architecture** What Already Works - EXPLORER_ADDRESSES env var accepts comma-separated addresses - Heartbeat sender already iterates over multiple explorer addresses (sends to all) - Proxy architecture is stateless — any explorer with the node registry could proxy - start.sh worker mode already supports multiple master addresses conceptually What Needs To Be Built Phase 1: State replication between explorers - Explorer-to-explorer heartbeat/sync protocol - Replicate ExplorerNode records between master instances - Conflict resolution strategy (last-write-wins by last_seen timestamp, or CRDT) - New mode: make start MODE=master PEER_MASTERS=<ip1>,<ip2> Phase 2: Worker multi-master awareness - Workers send heartbeats to all configured masters (already partially done) - Workers failover to next master if primary is unreachable - UI proxy failover: if /explorer/rpc fails on one master, try the next Phase 3: Consistency and health - Split-brain detection (two masters with conflicting node state) - Master health monitoring (masters monitor each other) - Stale master cleanup (if a master goes offline, its exclusive state is adopted) Design Considerations - State model: Explorer state is small (node metadata + stats). Full replication is feasible. - No shared database needed: Peer-to-peer replication between explorers is simpler than adding an external DB. - Consistency level: Eventual consistency is acceptable — node stats are already approximate (heartbeat-based). - VM operations: VM proxy calls are stateless (forwarded to the compute node). Any master with the node's socket_path can proxy. - Backwards compatible: Single-master mode must continue to work without configuration changes. Acceptance Criteria - Two master nodes can run simultaneously, each aware of all worker nodes - Workers register with both masters via heartbeats - UI on either master shows all nodes and VMs - If one master goes down, the other continues serving - make start MODE=master PEER_MASTERS=<ip> configures peering - Single-master mode (no PEER_MASTERS) works exactly as today