wip
This commit is contained in:
190
core/supervisor/docs/ARCHITECTURE.md
Normal file
190
core/supervisor/docs/ARCHITECTURE.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# Architecture of the `rhai_supervisor` Crate
|
||||
|
||||
The `rhai_supervisor` crate provides a Redis-based client library for submitting Rhai scripts to distributed worker services and awaiting their execution results. It implements a request-reply pattern using Redis as the message broker.
|
||||
|
||||
## Core Architecture
|
||||
|
||||
The client follows a builder pattern design with clear separation of concerns:
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[RhaiSupervisorBuilder] --> B[RhaiSupervisor]
|
||||
B --> C[PlayRequestBuilder]
|
||||
C --> D[PlayRequest]
|
||||
D --> E[Redis Task Queue]
|
||||
E --> F[Worker Service]
|
||||
F --> G[Redis Reply Queue]
|
||||
G --> H[Client Response]
|
||||
|
||||
subgraph "Client Components"
|
||||
A
|
||||
B
|
||||
C
|
||||
D
|
||||
end
|
||||
|
||||
subgraph "Redis Infrastructure"
|
||||
E
|
||||
G
|
||||
end
|
||||
|
||||
subgraph "External Services"
|
||||
F
|
||||
end
|
||||
```
|
||||
|
||||
## Key Components
|
||||
|
||||
### 1. RhaiSupervisorBuilder
|
||||
|
||||
A builder pattern implementation for constructing `RhaiSupervisor` instances with proper configuration validation.
|
||||
|
||||
**Responsibilities:**
|
||||
- Configure Redis connection URL
|
||||
- Set caller ID for task attribution
|
||||
- Validate configuration before building client
|
||||
|
||||
**Key Methods:**
|
||||
- `caller_id(id: &str)` - Sets the caller identifier
|
||||
- `redis_url(url: &str)` - Configures Redis connection
|
||||
- `build()` - Creates the final `RhaiSupervisor` instance
|
||||
|
||||
### 2. RhaiSupervisor
|
||||
|
||||
The main client interface that manages Redis connections and provides factory methods for creating play requests.
|
||||
|
||||
**Responsibilities:**
|
||||
- Maintain Redis connection pool
|
||||
- Provide factory methods for request builders
|
||||
- Handle low-level Redis operations
|
||||
- Manage task status queries
|
||||
|
||||
**Key Methods:**
|
||||
- `new_play_request()` - Creates a new `PlayRequestBuilder`
|
||||
- `get_task_status(task_id)` - Queries task status from Redis
|
||||
- Internal methods for Redis operations
|
||||
|
||||
### 3. PlayRequestBuilder
|
||||
|
||||
A fluent builder for constructing and submitting script execution requests.
|
||||
|
||||
**Responsibilities:**
|
||||
- Configure script execution parameters
|
||||
- Handle script loading from files or strings
|
||||
- Manage request timeouts
|
||||
- Provide submission methods (fire-and-forget vs await-response)
|
||||
|
||||
**Key Methods:**
|
||||
- `worker_id(id: &str)` - Target worker queue (determines which worker processes the task)
|
||||
- `context_id(id: &str)` - Target context ID (determines execution context/circle)
|
||||
- `script(content: &str)` - Set script content directly
|
||||
- `script_path(path: &str)` - Load script from file
|
||||
- `timeout(duration: Duration)` - Set execution timeout
|
||||
- `submit()` - Fire-and-forget submission
|
||||
- `await_response()` - Submit and wait for result
|
||||
|
||||
**Architecture Note:** The decoupling of `worker_id` and `context_id` allows a single worker to process tasks for multiple contexts (circles), providing greater deployment flexibility.
|
||||
|
||||
### 4. Data Structures
|
||||
|
||||
#### RhaiTaskDetails
|
||||
Represents the complete state of a task throughout its lifecycle.
|
||||
|
||||
```rust
|
||||
pub struct RhaiTaskDetails {
|
||||
pub task_id: String,
|
||||
pub script: String,
|
||||
pub status: String, // "pending", "processing", "completed", "error"
|
||||
pub output: Option<String>,
|
||||
pub error: Option<String>,
|
||||
pub created_at: DateTime<Utc>,
|
||||
pub updated_at: DateTime<Utc>,
|
||||
pub caller_id: String,
|
||||
}
|
||||
```
|
||||
|
||||
#### RhaiSupervisorError
|
||||
Comprehensive error handling for various failure scenarios:
|
||||
- `RedisError` - Redis connection/operation failures
|
||||
- `SerializationError` - JSON serialization/deserialization issues
|
||||
- `Timeout` - Task execution timeouts
|
||||
- `TaskNotFound` - Missing tasks after submission
|
||||
|
||||
## Communication Protocol
|
||||
|
||||
### Task Submission Flow
|
||||
|
||||
1. **Task Creation**: Client generates unique UUID for task identification
|
||||
2. **Task Storage**: Task details stored in Redis hash: `rhailib:<task_id>`
|
||||
3. **Queue Submission**: Task ID pushed to worker queue: `rhailib:<worker_id>`
|
||||
4. **Reply Queue Setup**: Client listens on: `rhailib:reply:<task_id>`
|
||||
|
||||
### Redis Key Patterns
|
||||
|
||||
- **Task Storage**: `rhailib:<task_id>` (Redis Hash)
|
||||
- **Worker Queues**: `rhailib:<worker_id>` (Redis List)
|
||||
- **Reply Queues**: `rhailib:reply:<task_id>` (Redis List)
|
||||
|
||||
### Message Flow Diagram
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant C as Client
|
||||
participant R as Redis
|
||||
participant W as Worker
|
||||
|
||||
C->>R: HSET rhailib:task_id (task details)
|
||||
C->>R: LPUSH rhailib:worker_id task_id
|
||||
C->>R: BLPOP rhailib:reply:task_id (blocking)
|
||||
|
||||
W->>R: BRPOP rhailib:worker_id (blocking)
|
||||
W->>W: Execute Rhai Script
|
||||
W->>R: LPUSH rhailib:reply:task_id (result)
|
||||
|
||||
R->>C: Return result from BLPOP
|
||||
C->>R: DEL rhailib:reply:task_id (cleanup)
|
||||
```
|
||||
|
||||
## Concurrency and Async Design
|
||||
|
||||
The client is built on `tokio` for asynchronous operations:
|
||||
|
||||
- **Connection Pooling**: Uses Redis multiplexed connections for efficiency
|
||||
- **Non-blocking Operations**: All Redis operations are async
|
||||
- **Timeout Handling**: Configurable timeouts with proper cleanup
|
||||
- **Error Propagation**: Comprehensive error handling with context
|
||||
|
||||
## Configuration and Deployment
|
||||
|
||||
### Prerequisites
|
||||
- Redis server accessible to both client and workers
|
||||
- Proper network connectivity between components
|
||||
- Sufficient Redis memory for task storage
|
||||
|
||||
### Configuration Options
|
||||
- **Redis URL**: Connection string for Redis instance
|
||||
- **Caller ID**: Unique identifier for client instance
|
||||
- **Timeouts**: Per-request timeout configuration
|
||||
- **Worker Targeting**: Direct worker queue addressing
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Task Isolation**: Each task uses unique identifiers
|
||||
- **Queue Separation**: Worker-specific queues prevent cross-contamination
|
||||
- **Cleanup**: Automatic cleanup of reply queues after completion
|
||||
- **Error Handling**: Secure error propagation without sensitive data leakage
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
- **Scalability**: Horizontal scaling through multiple worker instances
|
||||
- **Throughput**: Limited by Redis performance and network latency
|
||||
- **Memory Usage**: Efficient with connection pooling and cleanup
|
||||
- **Latency**: Low latency for local Redis deployments
|
||||
|
||||
## Integration Points
|
||||
|
||||
The client integrates with:
|
||||
- **Worker Services**: Via Redis queue protocol
|
||||
- **Monitoring Systems**: Through structured logging
|
||||
- **Application Code**: Via builder pattern API
|
||||
- **Configuration Systems**: Through environment variables and builders
|
272
core/supervisor/docs/protocol.md
Normal file
272
core/supervisor/docs/protocol.md
Normal file
@@ -0,0 +1,272 @@
|
||||
# Hero Supervisor Protocol
|
||||
|
||||
This document describes the Redis-based protocol used by the Hero Supervisor for job management and worker communication.
|
||||
|
||||
## Overview
|
||||
|
||||
The Hero Supervisor uses Redis as a message broker and data store for managing distributed job execution. Jobs are stored as Redis hashes, and communication with workers happens through Redis lists (queues).
|
||||
|
||||
## Redis Namespace
|
||||
|
||||
All supervisor-related keys use the `hero:` namespace prefix to avoid conflicts with other Redis usage.
|
||||
|
||||
## Data Structures
|
||||
|
||||
### Job Storage
|
||||
|
||||
Jobs are stored as Redis hashes with the following key pattern:
|
||||
```
|
||||
hero:job:{job_id}
|
||||
```
|
||||
|
||||
**Job Hash Fields:**
|
||||
- `id`: Unique job identifier (UUID v4)
|
||||
- `caller_id`: Identifier of the client that created the job
|
||||
- `worker_id`: Target worker identifier
|
||||
- `context_id`: Execution context identifier
|
||||
- `script`: Script content to execute (Rhai or HeroScript)
|
||||
- `timeout`: Execution timeout in seconds
|
||||
- `retries`: Number of retry attempts
|
||||
- `concurrent`: Whether to execute in separate thread (true/false)
|
||||
- `log_path`: Optional path to log file for job output
|
||||
- `created_at`: Job creation timestamp (ISO 8601)
|
||||
- `updated_at`: Job last update timestamp (ISO 8601)
|
||||
- `status`: Current job status (dispatched/started/error/finished)
|
||||
- `env_vars`: Environment variables as JSON object (optional)
|
||||
- `prerequisites`: JSON array of job IDs that must complete before this job (optional)
|
||||
- `dependents`: JSON array of job IDs that depend on this job completing (optional)
|
||||
- `output`: Job execution result (set by worker)
|
||||
- `error`: Error message if job failed (set by worker)
|
||||
- `dependencies`: List of job IDs that this job depends on
|
||||
|
||||
### Job Dependencies
|
||||
|
||||
Jobs can have dependencies on other jobs, which are stored in the `dependencies` field. A job will not be dispatched until all its dependencies have completed successfully.
|
||||
|
||||
### Work Queues
|
||||
|
||||
Jobs are queued for execution using Redis lists:
|
||||
```
|
||||
hero:work_queue:{worker_id}
|
||||
```
|
||||
|
||||
Workers listen on their specific queue using `BLPOP` for job IDs to process.
|
||||
|
||||
### Stop Queues
|
||||
|
||||
Job stop requests are sent through dedicated stop queues:
|
||||
```
|
||||
hero:stop_queue:{worker_id}
|
||||
```
|
||||
|
||||
Workers monitor these queues to receive stop requests for running jobs.
|
||||
|
||||
### Reply Queues
|
||||
|
||||
For synchronous job execution, dedicated reply queues are used:
|
||||
```
|
||||
hero:reply:{job_id}
|
||||
```
|
||||
|
||||
Workers send results to these queues when jobs complete.
|
||||
|
||||
## Job Lifecycle
|
||||
|
||||
### 1. Job Creation
|
||||
```
|
||||
Client -> Redis: HSET hero:job:{job_id} {job_fields}
|
||||
```
|
||||
|
||||
### 2. Job Submission
|
||||
```
|
||||
Client -> Redis: LPUSH hero:work_queue:{worker_id} {job_id}
|
||||
```
|
||||
|
||||
### 3. Job Processing
|
||||
```
|
||||
Worker -> Redis: BLPOP hero:work_queue:{worker_id}
|
||||
Worker -> Redis: HSET hero:job:{job_id} status "started"
|
||||
Worker: Execute script
|
||||
Worker -> Redis: HSET hero:job:{job_id} status "finished" output "{result}"
|
||||
```
|
||||
|
||||
### 4. Job Completion (Async)
|
||||
```
|
||||
Worker -> Redis: LPUSH hero:reply:{job_id} {result}
|
||||
```
|
||||
|
||||
## API Operations
|
||||
|
||||
### List Jobs
|
||||
```rust
|
||||
supervisor.list_jobs() -> Vec<String>
|
||||
```
|
||||
**Redis Operations:**
|
||||
- `KEYS hero:job:*` - Get all job keys
|
||||
- Extract job IDs from key names
|
||||
|
||||
### Stop Job
|
||||
```rust
|
||||
supervisor.stop_job(job_id) -> Result<(), SupervisorError>
|
||||
```
|
||||
**Redis Operations:**
|
||||
- `LPUSH hero:stop_queue:{worker_id} {job_id}` - Send stop request
|
||||
|
||||
### Get Job Status
|
||||
```rust
|
||||
supervisor.get_job_status(job_id) -> Result<JobStatus, SupervisorError>
|
||||
```
|
||||
**Redis Operations:**
|
||||
- `HGETALL hero:job:{job_id}` - Get job data
|
||||
- Parse `status` field
|
||||
|
||||
### Get Job Logs
|
||||
```rust
|
||||
supervisor.get_job_logs(job_id) -> Result<Option<String>, SupervisorError>
|
||||
```
|
||||
**Redis Operations:**
|
||||
- `HGETALL hero:job:{job_id}` - Get job data
|
||||
- Read `log_path` field
|
||||
- Read log file from filesystem
|
||||
|
||||
### Run Job and Await Result
|
||||
```rust
|
||||
supervisor.run_job_and_await_result(job, worker_id) -> Result<String, SupervisorError>
|
||||
```
|
||||
**Redis Operations:**
|
||||
1. `HSET hero:job:{job_id} {job_fields}` - Store job
|
||||
2. `LPUSH hero:work_queue:{worker_id} {job_id}` - Submit job
|
||||
3. `BLPOP hero:reply:{job_id} {timeout}` - Wait for result
|
||||
|
||||
## Worker Protocol
|
||||
|
||||
### Job Processing Loop
|
||||
```rust
|
||||
loop {
|
||||
// 1. Wait for job
|
||||
job_id = BLPOP hero:work_queue:{worker_id}
|
||||
|
||||
// 2. Get job details
|
||||
job_data = HGETALL hero:job:{job_id}
|
||||
|
||||
// 3. Update status
|
||||
HSET hero:job:{job_id} status "started"
|
||||
|
||||
// 4. Check for stop requests
|
||||
if LLEN hero:stop_queue:{worker_id} > 0 {
|
||||
stop_job_id = LPOP hero:stop_queue:{worker_id}
|
||||
if stop_job_id == job_id {
|
||||
HSET hero:job:{job_id} status "error" error "stopped"
|
||||
continue
|
||||
}
|
||||
}
|
||||
|
||||
// 5. Execute script
|
||||
result = execute_script(job_data.script)
|
||||
|
||||
// 6. Update job with result
|
||||
HSET hero:job:{job_id} status "finished" output result
|
||||
|
||||
// 7. Send reply if needed
|
||||
if reply_queue_exists(hero:reply:{job_id}) {
|
||||
LPUSH hero:reply:{job_id} result
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Stop Request Handling
|
||||
Workers should periodically check the stop queue during long-running jobs:
|
||||
```rust
|
||||
if LLEN hero:stop_queue:{worker_id} > 0 {
|
||||
stop_requests = LRANGE hero:stop_queue:{worker_id} 0 -1
|
||||
if stop_requests.contains(current_job_id) {
|
||||
// Stop current job execution
|
||||
HSET hero:job:{current_job_id} status "error" error "stopped_by_request"
|
||||
// Remove stop request
|
||||
LREM hero:stop_queue:{worker_id} 1 current_job_id
|
||||
return
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Job Timeouts
|
||||
- Client sets timeout when creating job
|
||||
- Worker should respect timeout and stop execution
|
||||
- If timeout exceeded: `HSET hero:job:{job_id} status "error" error "timeout"`
|
||||
|
||||
### Worker Failures
|
||||
- If worker crashes, job remains in "started" status
|
||||
- Monitoring systems can detect stale jobs and retry
|
||||
- Jobs can be requeued: `LPUSH hero:work_queue:{worker_id} {job_id}`
|
||||
|
||||
### Redis Connection Issues
|
||||
- Clients should implement retry logic with exponential backoff
|
||||
- Workers should reconnect and resume processing
|
||||
- Use Redis persistence to survive Redis restarts
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Queue Monitoring
|
||||
```bash
|
||||
# Check work queue length
|
||||
LLEN hero:work_queue:{worker_id}
|
||||
|
||||
# Check stop queue length
|
||||
LLEN hero:stop_queue:{worker_id}
|
||||
|
||||
# List all jobs
|
||||
KEYS hero:job:*
|
||||
|
||||
# Get job details
|
||||
HGETALL hero:job:{job_id}
|
||||
```
|
||||
|
||||
### Metrics to Track
|
||||
- Jobs created per second
|
||||
- Jobs completed per second
|
||||
- Average job execution time
|
||||
- Queue depths
|
||||
- Worker availability
|
||||
- Error rates by job type
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Redis Security
|
||||
- Use Redis AUTH for authentication
|
||||
- Enable TLS for Redis connections
|
||||
- Restrict Redis network access
|
||||
- Use Redis ACLs to limit worker permissions
|
||||
|
||||
### Job Security
|
||||
- Validate script content before execution
|
||||
- Sandbox script execution environment
|
||||
- Limit resource usage (CPU, memory, disk)
|
||||
- Log all job executions for audit
|
||||
|
||||
### Log File Security
|
||||
- Ensure log paths are within allowed directories
|
||||
- Validate log file permissions
|
||||
- Rotate and archive logs regularly
|
||||
- Sanitize sensitive data in logs
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Redis Optimization
|
||||
- Use Redis pipelining for batch operations
|
||||
- Configure appropriate Redis memory limits
|
||||
- Use Redis clustering for high availability
|
||||
- Monitor Redis memory usage and eviction
|
||||
|
||||
### Job Optimization
|
||||
- Keep job payloads small
|
||||
- Use efficient serialization formats
|
||||
- Batch similar jobs when possible
|
||||
- Implement job prioritization if needed
|
||||
|
||||
### Worker Optimization
|
||||
- Pool worker connections to Redis
|
||||
- Use async I/O for Redis operations
|
||||
- Implement graceful shutdown handling
|
||||
- Monitor worker resource usage
|
Reference in New Issue
Block a user