Updates

2025-08-14 14:14:34 +02:00
parent 04a1af2423
commit 0ebda7c1aa
59 changed files with 6950 additions and 354 deletions
--- a/docs/REDIS_QUEUES_NAMING_PROPOSAL.md
+++ b/docs/REDIS_QUEUES_NAMING_PROPOSAL.md
@@ -0,0 +1,231 @@
+# Redis Queue Naming Proposal (Multi-Actor, Multi-Type, Scalable)
+
+Goal
+- Define a consistent, future-proof Redis naming scheme that:
+  - Supports multiple actor types (OSIS, SAL, V, Python)
+  - Supports multiple pools/groups and instances per type
+  - Enables fair load-balancing and targeted dispatch
+  - Works with both “hash-output” actors and “reply-queue” actors
+  - Keeps migration straightforward from the current keys
+
+Motivation
+- Today, multiple non-unified patterns exist:
+  - Per-actor keys like "hero:job:{actor_id}" consumed by in-crate Rhai actor
+  - Per-type keys like "hero:job:actor_queue:{suffix}" used by other components
+  - Protocol docs that reference "hero:work_queue:{actor_id}" and "hero:reply:{job_id}"
+- This fragmentation causes stuck “Dispatched” jobs when the LPUSH target doesn’t match the BLPOP listener. We need one canonical scheme, with well-defined fallbacks.
+
+
+## 1) Canonical Key Names
+
+Prefix conventions
+- Namespace prefix: hero:
+- All queues collected under hero:q:* to separate from job hashes hero:job:*
+- All metadata under hero:meta:* for discoverability
+
+Job and result keys
+- Job hash (unchanged): hero:job:{job_id}
+- Reply queue: hero:q:reply:{job_id}
+
+Work queues (new canonical)
+- Type queue (shared): hero:q:work:type:{script_type}
+  - Examples:
+    - hero:q:work:type:osis
+    - hero:q:work:type:sal
+    - hero:q:work:type:v
+    - hero:q:work:type:python
+- Group queue (optional, shared within a group): hero:q:work:type:{script_type}:group:{group}
+  - Examples:
+    - hero:q:work:type:osis:group:default
+    - hero:q:work:type:sal:group:io
+- Instance queue (most specific, used for targeted dispatch): hero:q:work:type:{script_type}:group:{group}:inst:{instance}
+  - Examples:
+    - hero:q:work:type:osis:group:default:inst:1
+    - hero:q:work:type:sal:group:io:inst:3
+
+Control queues (optional, future)
+- Stop/control per-type: hero:q:ctl:type:{script_type}
+- Stop/control per-instance: hero:q:ctl:type:{script_type}:group:{group}:inst:{instance}
+
+Actor presence and metadata
+- Instance presence (ephemeral, with TTL refresh): hero:meta:actor:inst:{script_type}:{group}:{instance}
+  - Value: JSON { pid, hostname, started_at, version, capabilities, last_heartbeat }
+  - Used by the supervisor to discover live consumers and to select targeted queueing
+
+
+## 2) Dispatch Strategy
+
+- Default: Push to the Type queue hero:q:work:type:{script_type}
+  - Allows N instances to BLPOP the same shared queue (standard fan-out).
+- Targeted: If user or scheduler specifies a group and/or instance, push to the most specific queue
+  - Instance queue (highest specificity):
+    - hero:q:work:type:{script_type}:group:{group}:inst:{instance}
+  - Else Group queue:
+    - hero:q:work:type:{script_type}:group:{group}
+  - Else Type queue (fallback):
+    - hero:q:work:type:{script_type}
+- Priority queues (optional extension):
+  - Append :prio:{level} to any of the above
+  - Actors BLPOP a list of queues in priority order
+
+Example routing
+- No group/instance specified:
+  - LPUSH hero:q:work:type:osis {job_id}
+- Group specified ("default"), no instance:
+  - LPUSH hero:q:work:type:osis:group:default {job_id}
+- Specific instance:
+  - LPUSH hero:q:work:type:osis:group:default:inst:2 {job_id}
+
+
+## 3) Actor Consumption Strategy
+
+- Actor identifies itself with:
+  - script_type (osis/sal/v/python)
+  - group (defaults to "default")
+  - instance number (unique within group)
+- Actor registers presence:
+  - SET hero:meta:actor:inst:{script_type}:{group}:{instance} {...} EX 15
+  - Periodically refresh to act as heartbeat
+- Actor BLPOP order:
+  1) Instance queue (most specific)
+  2) Group queue
+  3) Type queue
+- This ensures targeted jobs are taken first (if any), otherwise fall back to group or shared type queue.
+- Actors that implement reply-queue semantics will also LPUSH to hero:q:reply:{job_id} on completion. Others just update hero:job:{job_id} with status+output.
+
+
+## 4) Backward Compatibility And Migration
+
+- During transition, Supervisor can LPUSH to both:
+  - New canonical queues (hero:q:work:type:...)
+  - Selected legacy queues (hero:job:actor_queue:{suffix}, hero:job:{actor_id}, hero:work_queue:...)
+- Actors:
+  - Update actors to BLPOP the canonical queues first, then legacy fallback
+- Phased plan:
+  1) Introduce canonical queues alongside legacy; Supervisor pushes to both (compat mode)
+  2) Switch actors to consume canonical first
+  3) Deprecate legacy queues and remove dual-push
+- No change to job hashes hero:job:{job_id}
+
+
+## 5) Required Code Changes (by file)
+
+Supervisor (routing and reply queue)
+- Replace queue computation with canonical builder:
+  - [rust.Supervisor::get_actor_queue_key()](core/supervisor/src/lib.rs:410)
+    - Change to build canonical keys given script_type (+ optional group/instance from Job or policy)
+- Update start logic to LPUSH to canonical queue(s):
+  - [rust.Supervisor::start_job_using_connection()](core/supervisor/src/lib.rs:599)
+    - Use only canonical queue(s). In migration phase, also LPUSH legacy queues.
+- Standardize reply queue name:
+  - [rust.Supervisor::run_job_and_await_result()](core/supervisor/src/lib.rs:689)
+    - Use hero:q:reply:{job_id}
+    - Keep “poll job hash” fallback for actors that don’t use reply queues
+- Stop queue naming:
+  - [rust.Supervisor::stop_job()](core/supervisor/src/lib.rs:789)
+    - Use hero:q:ctl:type:{script_type} in canonical mode
+
+Actor (consumption and presence)
+- In-crate Rhai actor:
+  - Queue key construction and BLPOP list:
+    - [rust.spawn_rhai_actor()](core/actor/src/lib.rs:211)
+    - Current queue_key at [core/actor/src/lib.rs:220]
+    - Replace single-queue BLPOP with multi-key BLPOP in priority order:
+      1) hero:q:work:type:{script_type}:group:{group}:inst:{instance}
+      2) hero:q:work:type:{script_type}:group:{group}
+      3) hero:q:work:type:{script_type}
+    - For migration, optionally include legacy queues last.
+  - Presence registration (periodic SET with TTL):
+    - Add at actor startup and refresh on loop tick
+- For actors that implement reply queues:
+  - After finishing job, LPUSH hero:q:reply:{job_id} {result}
+  - For hash-only actors, continue to call [rust.Job::set_result()](core/job/src/lib.rs:322)
+
+Shared constants (avoid string drift)
+- Introduce constants and helpers in a central crate (hero_job) to build keys consistently:
+  - fn job_hash_key(job_id) -> "hero:job:{job_id}"
+  - fn reply_queue_key(job_id) -> "hero:q:reply:{job_id}"
+  - fn work_queue_type(script_type) -> "hero:q:work:type:{type}"
+  - fn work_queue_group(script_type, group) -> "hero:q:work:type:{type}:group:{group}"
+  - fn work_queue_instance(script_type, group, inst) -> "hero:q:work:type:{type}:group:{group}:inst:{inst}"
+- Replace open-coded strings in:
+  - [rust.Supervisor](core/supervisor/src/lib.rs:1)
+  - [rust.Actor code](core/actor/src/lib.rs:1)
+  - Any CLI/TUI or interface components that reference queues
+
+Interfaces
+- OpenRPC/WebSocket servers do not need to know queue names; they call Supervisor API. No changes except to follow the Supervisor’s behavior for “run-and-wait” vs “create+start+get_output” flows.
+
+
+## 6) Example Scenarios
+
+Scenario A: Single OSIS pool with two instances
+- Actors:
+  - osis group=default inst=1
+  - osis group=default inst=2
+- Incoming job (no targeting):
+  - LPUSH hero:q:work:type:osis {job_id}
+- Actors BLPOP order:
+  - inst queue
+  - group queue
+  - type queue (this one will supply)
+- Effective result: classic round-robin-like behavior, two workers share load.
+
+Scenario B: SAL pool “io” with instance 3; targeted dispatch
+- Job sets target group=io and instance=3
+- Supervisor LPUSH hero:q:work:type:sal:group:io:inst:3 {job_id}
+- Only that instance consumes it, enabling pinning to a specific worker.
+
+Scenario C: Mixed old and new actors (migration window)
+- Supervisor pushes to canonical queue(s) and to a legacy queue hero:job:actor_queue:osis
+- New actors consume canonical queues
+- Legacy actors consume legacy queue
+- No job is stuck; both ecosystems coexist until the legacy path is removed.
+
+
+## 7) Phased Migration Plan
+
+Phase 0 (Docs + helpers)
+- Add helpers in hero_job to compute keys (see “Shared constants”)
+- Document the new scheme and consumption order (this file)
+
+Phase 1 (Supervisor)
+- Update [rust.Supervisor::get_actor_queue_key()](core/supervisor/src/lib.rs:410) and [rust.Supervisor::start_job_using_connection()](core/supervisor/src/lib.rs:599) to use canonical queues
+- Keep dual-push to legacy queues behind a feature flag or config for rollout
+- Standardize reply queue to hero:q:reply:{job_id} in [rust.Supervisor::run_job_and_await_result()](core/supervisor/src/lib.rs:689)
+
+Phase 2 (Actors)
+- Update [rust.spawn_rhai_actor()](core/actor/src/lib.rs:211) to BLPOP from canonical queues in priority order and to register presence keys
+- Optionally emit reply to hero:q:reply:{job_id} in addition to hash-based result (feature flag)
+
+Phase 3 (Cleanup)
+- After all actors and Supervisor deployments are updated and stable, remove the legacy dual-push and fallback consume paths
+
+
+## 8) Optional Enhancements
+
+- Priority queues:
+  - Suffix queues with :prio:{0|1|2}; actors BLPOP [inst prio0, group prio0, type prio0, inst prio1, group prio1, type prio1, ...]
+- Rate limiting/back-pressure:
+  - Use metadata to signal busy state or reported in-flight jobs; Supervisor can target instance queues accordingly.
+- Resilience:
+  - Move to Redis Streams for job event logs; lists remain fine for simple FIFO processing.
+- Observability:
+  - hero:meta:actor:* and hero:meta:queue:stats:* to keep simple metrics for dashboards.
+
+
+## 9) Summary
+
+- Canonicalize to hero:q:work:type:{...} (+ group, + instance), and hero:q:reply:{job_id}
+- Actors consume instance → group → type
+- Supervisor pushes to most specific queue available, defaulting to type
+- Provide helpers to build keys and remove ad-hoc string formatting
+- Migrate with a dual-push (canonical + legacy) phase to avoid downtime
+
+Proposed touchpoints to implement (clickable references)
+- [rust.Supervisor::get_actor_queue_key()](core/supervisor/src/lib.rs:410)
+- [rust.Supervisor::start_job_using_connection()](core/supervisor/src/lib.rs:599)
+- [rust.Supervisor::run_job_and_await_result()](core/supervisor/src/lib.rs:689)
+- [rust.spawn_rhai_actor()](core/actor/src/lib.rs:211)
+- [core/actor/src/lib.rs](core/actor/src/lib.rs:220)
+- [rust.Job::set_result()](core/job/src/lib.rs:322)