multi-master explorer support for high availability #44

Open
opened 2026-03-24 13:33:51 +00:00 by mahmoud · 0 comments
Owner

Description

Currently hero_compute supports only a single master (explorer) node. If the master goes down, the explorer is unavailable and workers cannot register or have their VMs managed remotely. We need multi-master support for high availability.

Current Architecture

  • 1 master runs the explorer with a local OSIS database
  • N workers send heartbeats to the single master
  • All VM proxy operations route through that single explorer
  • If the master dies, the explorer is unavailable

Proposed Architecture

What Already Works

  • EXPLORER_ADDRESSES env var accepts comma-separated addresses
  • Heartbeat sender already iterates over multiple explorer addresses (sends to all)
  • Proxy architecture is stateless — any explorer with the node registry could proxy
  • start.sh worker mode already supports multiple master addresses conceptually

What Needs To Be Built

Phase 1: State replication between explorers

  • Explorer-to-explorer heartbeat/sync protocol
  • Replicate ExplorerNode records between master instances
  • Conflict resolution strategy (last-write-wins by last_seen timestamp, or CRDT)
  • New mode: make start MODE=master PEER_MASTERS=,

Phase 2: Worker multi-master awareness

  • Workers send heartbeats to all configured masters (already partially done)
  • Workers failover to next master if primary is unreachable
  • UI proxy failover: if /explorer/rpc fails on one master, try the next

Phase 3: Consistency and health

  • Split-brain detection (two masters with conflicting node state)
  • Master health monitoring (masters monitor each other)
  • Stale master cleanup (if a master goes offline, its exclusive state is adopted)

Design Considerations

  • State model: Explorer state is small (node metadata + stats). Full replication is feasible.
  • No shared database needed: Peer-to-peer replication between explorers is simpler than adding an external DB.
  • Consistency level: Eventual consistency is acceptable — node stats are already approximate (heartbeat-based).
  • VM operations: VM proxy calls are stateless (forwarded to the compute node). Any master with the node's socket_path can
    proxy.
  • Backwards compatible: Single-master mode must continue to work without configuration changes.

Acceptance Criteria

  • Two master nodes can run simultaneously, each aware of all worker nodes
  • Workers register with both masters via heartbeats
  • UI on either master shows all nodes and VMs
  • If one master goes down, the other continues serving
  • make start MODE=master PEER_MASTERS= configures peering
  • Single-master mode (no PEER_MASTERS) works exactly as today
### Description Currently hero_compute supports only a single master (explorer) node. If the master goes down, the explorer is unavailable and workers cannot register or have their VMs managed remotely. We need multi-master support for high availability. **Current Architecture** - 1 master runs the explorer with a local OSIS database - N workers send heartbeats to the single master - All VM proxy operations route through that single explorer - If the master dies, the explorer is unavailable **Proposed Architecture** What Already Works - EXPLORER_ADDRESSES env var accepts comma-separated addresses - Heartbeat sender already iterates over multiple explorer addresses (sends to all) - Proxy architecture is stateless — any explorer with the node registry could proxy - start.sh worker mode already supports multiple master addresses conceptually What Needs To Be Built Phase 1: State replication between explorers - Explorer-to-explorer heartbeat/sync protocol - Replicate ExplorerNode records between master instances - Conflict resolution strategy (last-write-wins by last_seen timestamp, or CRDT) - New mode: make start MODE=master PEER_MASTERS=<ip1>,<ip2> Phase 2: Worker multi-master awareness - Workers send heartbeats to all configured masters (already partially done) - Workers failover to next master if primary is unreachable - UI proxy failover: if /explorer/rpc fails on one master, try the next Phase 3: Consistency and health - Split-brain detection (two masters with conflicting node state) - Master health monitoring (masters monitor each other) - Stale master cleanup (if a master goes offline, its exclusive state is adopted) Design Considerations - State model: Explorer state is small (node metadata + stats). Full replication is feasible. - No shared database needed: Peer-to-peer replication between explorers is simpler than adding an external DB. - Consistency level: Eventual consistency is acceptable — node stats are already approximate (heartbeat-based). - VM operations: VM proxy calls are stateless (forwarded to the compute node). Any master with the node's socket_path can proxy. - Backwards compatible: Single-master mode must continue to work without configuration changes. Acceptance Criteria - Two master nodes can run simultaneously, each aware of all worker nodes - Workers register with both masters via heartbeats - UI on either master shows all nodes and VMs - If one master goes down, the other continues serving - make start MODE=master PEER_MASTERS=<ip> configures peering - Single-master mode (no PEER_MASTERS) works exactly as today
mahmoud added this to the later milestone 2026-03-31 14:05:49 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#44
No description provided.