wip

2025-08-01 00:01:08 +02:00
parent 32c2cbe0cc
commit 8ed40ce99c
57 changed files with 2047 additions and 4113 deletions
--- a/core/supervisor/docs/ARCHITECTURE.md
+++ b/core/supervisor/docs/ARCHITECTURE.md
@@ -0,0 +1,190 @@
+# Architecture of the `rhai_supervisor` Crate
+
+The `rhai_supervisor` crate provides a Redis-based client library for submitting Rhai scripts to distributed worker services and awaiting their execution results. It implements a request-reply pattern using Redis as the message broker.
+
+## Core Architecture
+
+The client follows a builder pattern design with clear separation of concerns:
+
+```mermaid
+graph TD
+    A[RhaiSupervisorBuilder] --> B[RhaiSupervisor]
+    B --> C[PlayRequestBuilder]
+    C --> D[PlayRequest]
+    D --> E[Redis Task Queue]
+    E --> F[Worker Service]
+    F --> G[Redis Reply Queue]
+    G --> H[Client Response]
+    
+    subgraph "Client Components"
+        A
+        B
+        C
+        D
+    end
+    
+    subgraph "Redis Infrastructure"
+        E
+        G
+    end
+    
+    subgraph "External Services"
+        F
+    end
+```
+
+## Key Components
+
+### 1. RhaiSupervisorBuilder
+
+A builder pattern implementation for constructing `RhaiSupervisor` instances with proper configuration validation.
+
+**Responsibilities:**
+- Configure Redis connection URL
+- Set caller ID for task attribution
+- Validate configuration before building client
+
+**Key Methods:**
+- `caller_id(id: &str)` - Sets the caller identifier
+- `redis_url(url: &str)` - Configures Redis connection
+- `build()` - Creates the final `RhaiSupervisor` instance
+
+### 2. RhaiSupervisor
+
+The main client interface that manages Redis connections and provides factory methods for creating play requests.
+
+**Responsibilities:**
+- Maintain Redis connection pool
+- Provide factory methods for request builders
+- Handle low-level Redis operations
+- Manage task status queries
+
+**Key Methods:**
+- `new_play_request()` - Creates a new `PlayRequestBuilder`
+- `get_task_status(task_id)` - Queries task status from Redis
+- Internal methods for Redis operations
+
+### 3. PlayRequestBuilder
+
+A fluent builder for constructing and submitting script execution requests.
+
+**Responsibilities:**
+- Configure script execution parameters
+- Handle script loading from files or strings
+- Manage request timeouts
+- Provide submission methods (fire-and-forget vs await-response)
+
+**Key Methods:**
+- `worker_id(id: &str)` - Target worker queue (determines which worker processes the task)
+- `context_id(id: &str)` - Target context ID (determines execution context/circle)
+- `script(content: &str)` - Set script content directly
+- `script_path(path: &str)` - Load script from file
+- `timeout(duration: Duration)` - Set execution timeout
+- `submit()` - Fire-and-forget submission
+- `await_response()` - Submit and wait for result
+
+**Architecture Note:** The decoupling of `worker_id` and `context_id` allows a single worker to process tasks for multiple contexts (circles), providing greater deployment flexibility.
+
+### 4. Data Structures
+
+#### RhaiTaskDetails
+Represents the complete state of a task throughout its lifecycle.
+
+```rust
+pub struct RhaiTaskDetails {
+    pub task_id: String,
+    pub script: String,
+    pub status: String,        // "pending", "processing", "completed", "error"
+    pub output: Option<String>,
+    pub error: Option<String>,
+    pub created_at: DateTime<Utc>,
+    pub updated_at: DateTime<Utc>,
+    pub caller_id: String,
+}
+```
+
+#### RhaiSupervisorError
+Comprehensive error handling for various failure scenarios:
+- `RedisError` - Redis connection/operation failures
+- `SerializationError` - JSON serialization/deserialization issues
+- `Timeout` - Task execution timeouts
+- `TaskNotFound` - Missing tasks after submission
+
+## Communication Protocol
+
+### Task Submission Flow
+
+1. **Task Creation**: Client generates unique UUID for task identification
+2. **Task Storage**: Task details stored in Redis hash: `rhailib:<task_id>`
+3. **Queue Submission**: Task ID pushed to worker queue: `rhailib:<worker_id>`
+4. **Reply Queue Setup**: Client listens on: `rhailib:reply:<task_id>`
+
+### Redis Key Patterns
+
+- **Task Storage**: `rhailib:<task_id>` (Redis Hash)
+- **Worker Queues**: `rhailib:<worker_id>` (Redis List)
+- **Reply Queues**: `rhailib:reply:<task_id>` (Redis List)
+
+### Message Flow Diagram
+
+```mermaid
+sequenceDiagram
+    participant C as Client
+    participant R as Redis
+    participant W as Worker
+    
+    C->>R: HSET rhailib:task_id (task details)
+    C->>R: LPUSH rhailib:worker_id task_id
+    C->>R: BLPOP rhailib:reply:task_id (blocking)
+    
+    W->>R: BRPOP rhailib:worker_id (blocking)
+    W->>W: Execute Rhai Script
+    W->>R: LPUSH rhailib:reply:task_id (result)
+    
+    R->>C: Return result from BLPOP
+    C->>R: DEL rhailib:reply:task_id (cleanup)
+```
+
+## Concurrency and Async Design
+
+The client is built on `tokio` for asynchronous operations:
+
+- **Connection Pooling**: Uses Redis multiplexed connections for efficiency
+- **Non-blocking Operations**: All Redis operations are async
+- **Timeout Handling**: Configurable timeouts with proper cleanup
+- **Error Propagation**: Comprehensive error handling with context
+
+## Configuration and Deployment
+
+### Prerequisites
+- Redis server accessible to both client and workers
+- Proper network connectivity between components
+- Sufficient Redis memory for task storage
+
+### Configuration Options
+- **Redis URL**: Connection string for Redis instance
+- **Caller ID**: Unique identifier for client instance
+- **Timeouts**: Per-request timeout configuration
+- **Worker Targeting**: Direct worker queue addressing
+
+## Security Considerations
+
+- **Task Isolation**: Each task uses unique identifiers
+- **Queue Separation**: Worker-specific queues prevent cross-contamination
+- **Cleanup**: Automatic cleanup of reply queues after completion
+- **Error Handling**: Secure error propagation without sensitive data leakage
+
+## Performance Characteristics
+
+- **Scalability**: Horizontal scaling through multiple worker instances
+- **Throughput**: Limited by Redis performance and network latency
+- **Memory Usage**: Efficient with connection pooling and cleanup
+- **Latency**: Low latency for local Redis deployments
+
+## Integration Points
+
+The client integrates with:
+- **Worker Services**: Via Redis queue protocol
+- **Monitoring Systems**: Through structured logging
+- **Application Code**: Via builder pattern API
+- **Configuration Systems**: Through environment variables and builders
--- a/core/supervisor/docs/protocol.md
+++ b/core/supervisor/docs/protocol.md
@@ -0,0 +1,272 @@
+# Hero Supervisor Protocol
+
+This document describes the Redis-based protocol used by the Hero Supervisor for job management and worker communication.
+
+## Overview
+
+The Hero Supervisor uses Redis as a message broker and data store for managing distributed job execution. Jobs are stored as Redis hashes, and communication with workers happens through Redis lists (queues).
+
+## Redis Namespace
+
+All supervisor-related keys use the `hero:` namespace prefix to avoid conflicts with other Redis usage.
+
+## Data Structures
+
+### Job Storage
+
+Jobs are stored as Redis hashes with the following key pattern:
+```
+hero:job:{job_id}
+```
+
+**Job Hash Fields:**
+- `id`: Unique job identifier (UUID v4)
+- `caller_id`: Identifier of the client that created the job
+- `worker_id`: Target worker identifier
+- `context_id`: Execution context identifier
+- `script`: Script content to execute (Rhai or HeroScript)
+- `timeout`: Execution timeout in seconds
+- `retries`: Number of retry attempts
+- `concurrent`: Whether to execute in separate thread (true/false)
+- `log_path`: Optional path to log file for job output
+- `created_at`: Job creation timestamp (ISO 8601)
+- `updated_at`: Job last update timestamp (ISO 8601)
+- `status`: Current job status (dispatched/started/error/finished)
+- `env_vars`: Environment variables as JSON object (optional)
+- `prerequisites`: JSON array of job IDs that must complete before this job (optional)
+- `dependents`: JSON array of job IDs that depend on this job completing (optional)
+- `output`: Job execution result (set by worker)
+- `error`: Error message if job failed (set by worker)
+- `dependencies`: List of job IDs that this job depends on
+
+### Job Dependencies
+
+Jobs can have dependencies on other jobs, which are stored in the `dependencies` field. A job will not be dispatched until all its dependencies have completed successfully.
+
+### Work Queues
+
+Jobs are queued for execution using Redis lists:
+```
+hero:work_queue:{worker_id}
+```
+
+Workers listen on their specific queue using `BLPOP` for job IDs to process.
+
+### Stop Queues
+
+Job stop requests are sent through dedicated stop queues:
+```
+hero:stop_queue:{worker_id}
+```
+
+Workers monitor these queues to receive stop requests for running jobs.
+
+### Reply Queues
+
+For synchronous job execution, dedicated reply queues are used:
+```
+hero:reply:{job_id}
+```
+
+Workers send results to these queues when jobs complete.
+
+## Job Lifecycle
+
+### 1. Job Creation
+```
+Client -> Redis: HSET hero:job:{job_id} {job_fields}
+```
+
+### 2. Job Submission
+```
+Client -> Redis: LPUSH hero:work_queue:{worker_id} {job_id}
+```
+
+### 3. Job Processing
+```
+Worker -> Redis: BLPOP hero:work_queue:{worker_id}
+Worker -> Redis: HSET hero:job:{job_id} status "started"
+Worker: Execute script
+Worker -> Redis: HSET hero:job:{job_id} status "finished" output "{result}"
+```
+
+### 4. Job Completion (Async)
+```
+Worker -> Redis: LPUSH hero:reply:{job_id} {result}
+```
+
+## API Operations
+
+### List Jobs
+```rust
+supervisor.list_jobs() -> Vec<String>
+```
+**Redis Operations:**
+- `KEYS hero:job:*` - Get all job keys
+- Extract job IDs from key names
+
+### Stop Job
+```rust
+supervisor.stop_job(job_id) -> Result<(), SupervisorError>
+```
+**Redis Operations:**
+- `LPUSH hero:stop_queue:{worker_id} {job_id}` - Send stop request
+
+### Get Job Status
+```rust
+supervisor.get_job_status(job_id) -> Result<JobStatus, SupervisorError>
+```
+**Redis Operations:**
+- `HGETALL hero:job:{job_id}` - Get job data
+- Parse `status` field
+
+### Get Job Logs
+```rust
+supervisor.get_job_logs(job_id) -> Result<Option<String>, SupervisorError>
+```
+**Redis Operations:**
+- `HGETALL hero:job:{job_id}` - Get job data
+- Read `log_path` field
+- Read log file from filesystem
+
+### Run Job and Await Result
+```rust
+supervisor.run_job_and_await_result(job, worker_id) -> Result<String, SupervisorError>
+```
+**Redis Operations:**
+1. `HSET hero:job:{job_id} {job_fields}` - Store job
+2. `LPUSH hero:work_queue:{worker_id} {job_id}` - Submit job
+3. `BLPOP hero:reply:{job_id} {timeout}` - Wait for result
+
+## Worker Protocol
+
+### Job Processing Loop
+```rust
+loop {
+    // 1. Wait for job
+    job_id = BLPOP hero:work_queue:{worker_id}
+    
+    // 2. Get job details
+    job_data = HGETALL hero:job:{job_id}
+    
+    // 3. Update status
+    HSET hero:job:{job_id} status "started"
+    
+    // 4. Check for stop requests
+    if LLEN hero:stop_queue:{worker_id} > 0 {
+        stop_job_id = LPOP hero:stop_queue:{worker_id}
+        if stop_job_id == job_id {
+            HSET hero:job:{job_id} status "error" error "stopped"
+            continue
+        }
+    }
+    
+    // 5. Execute script
+    result = execute_script(job_data.script)
+    
+    // 6. Update job with result
+    HSET hero:job:{job_id} status "finished" output result
+    
+    // 7. Send reply if needed
+    if reply_queue_exists(hero:reply:{job_id}) {
+        LPUSH hero:reply:{job_id} result
+    }
+}
+```
+
+### Stop Request Handling
+Workers should periodically check the stop queue during long-running jobs:
+```rust
+if LLEN hero:stop_queue:{worker_id} > 0 {
+    stop_requests = LRANGE hero:stop_queue:{worker_id} 0 -1
+    if stop_requests.contains(current_job_id) {
+        // Stop current job execution
+        HSET hero:job:{current_job_id} status "error" error "stopped_by_request"
+        // Remove stop request
+        LREM hero:stop_queue:{worker_id} 1 current_job_id
+        return
+    }
+}
+```
+
+## Error Handling
+
+### Job Timeouts
+- Client sets timeout when creating job
+- Worker should respect timeout and stop execution
+- If timeout exceeded: `HSET hero:job:{job_id} status "error" error "timeout"`
+
+### Worker Failures
+- If worker crashes, job remains in "started" status
+- Monitoring systems can detect stale jobs and retry
+- Jobs can be requeued: `LPUSH hero:work_queue:{worker_id} {job_id}`
+
+### Redis Connection Issues
+- Clients should implement retry logic with exponential backoff
+- Workers should reconnect and resume processing
+- Use Redis persistence to survive Redis restarts
+
+## Monitoring and Observability
+
+### Queue Monitoring
+```bash
+# Check work queue length
+LLEN hero:work_queue:{worker_id}
+
+# Check stop queue length  
+LLEN hero:stop_queue:{worker_id}
+
+# List all jobs
+KEYS hero:job:*
+
+# Get job details
+HGETALL hero:job:{job_id}
+```
+
+### Metrics to Track
+- Jobs created per second
+- Jobs completed per second
+- Average job execution time
+- Queue depths
+- Worker availability
+- Error rates by job type
+
+## Security Considerations
+
+### Redis Security
+- Use Redis AUTH for authentication
+- Enable TLS for Redis connections
+- Restrict Redis network access
+- Use Redis ACLs to limit worker permissions
+
+### Job Security
+- Validate script content before execution
+- Sandbox script execution environment
+- Limit resource usage (CPU, memory, disk)
+- Log all job executions for audit
+
+### Log File Security
+- Ensure log paths are within allowed directories
+- Validate log file permissions
+- Rotate and archive logs regularly
+- Sanitize sensitive data in logs
+
+## Performance Considerations
+
+### Redis Optimization
+- Use Redis pipelining for batch operations
+- Configure appropriate Redis memory limits
+- Use Redis clustering for high availability
+- Monitor Redis memory usage and eviction
+
+### Job Optimization
+- Keep job payloads small
+- Use efficient serialization formats
+- Batch similar jobs when possible
+- Implement job prioritization if needed
+
+### Worker Optimization
+- Pool worker connections to Redis
+- Use async I/O for Redis operations
+- Implement graceful shutdown handling
+- Monitor worker resource usage