rename worker to actor

This commit is contained in:
Timur Gordon
2025-08-05 15:44:33 +02:00
parent 5283f383b3
commit 89e953ca1d
67 changed files with 1629 additions and 1737 deletions

View File

@@ -1,10 +1,10 @@
# Hero Supervisor Protocol
This document describes the Redis-based protocol used by the Hero Supervisor for job management and worker communication.
This document describes the Redis-based protocol used by the Hero Supervisor for job management and actor communication.
## Overview
The Hero Supervisor uses Redis as a message broker and data store for managing distributed job execution. Jobs are stored as Redis hashes, and communication with workers happens through Redis lists (queues).
The Hero Supervisor uses Redis as a message broker and data store for managing distributed job execution. Jobs are stored as Redis hashes, and communication with actors happens through Redis lists (queues).
## Redis Namespace
@@ -22,7 +22,7 @@ hero:job:{job_id}
**Job Hash Fields:**
- `id`: Unique job identifier (UUID v4)
- `caller_id`: Identifier of the client that created the job
- `worker_id`: Target worker identifier
- `actor_id`: Target actor identifier
- `context_id`: Execution context identifier
- `script`: Script content to execute (Rhai or HeroScript)
- `timeout`: Execution timeout in seconds
@@ -35,8 +35,8 @@ hero:job:{job_id}
- `env_vars`: Environment variables as JSON object (optional)
- `prerequisites`: JSON array of job IDs that must complete before this job (optional)
- `dependents`: JSON array of job IDs that depend on this job completing (optional)
- `output`: Job execution result (set by worker)
- `error`: Error message if job failed (set by worker)
- `output`: Job execution result (set by actor)
- `error`: Error message if job failed (set by actor)
- `dependencies`: List of job IDs that this job depends on
### Job Dependencies
@@ -47,19 +47,19 @@ Jobs can have dependencies on other jobs, which are stored in the `dependencies`
Jobs are queued for execution using Redis lists:
```
hero:work_queue:{worker_id}
hero:work_queue:{actor_id}
```
Workers listen on their specific queue using `BLPOP` for job IDs to process.
Actors listen on their specific queue using `BLPOP` for job IDs to process.
### Stop Queues
Job stop requests are sent through dedicated stop queues:
```
hero:stop_queue:{worker_id}
hero:stop_queue:{actor_id}
```
Workers monitor these queues to receive stop requests for running jobs.
Actors monitor these queues to receive stop requests for running jobs.
### Reply Queues
@@ -68,7 +68,7 @@ For synchronous job execution, dedicated reply queues are used:
hero:reply:{job_id}
```
Workers send results to these queues when jobs complete.
Actors send results to these queues when jobs complete.
## Job Lifecycle
@@ -79,20 +79,20 @@ Client -> Redis: HSET hero:job:{job_id} {job_fields}
### 2. Job Submission
```
Client -> Redis: LPUSH hero:work_queue:{worker_id} {job_id}
Client -> Redis: LPUSH hero:work_queue:{actor_id} {job_id}
```
### 3. Job Processing
```
Worker -> Redis: BLPOP hero:work_queue:{worker_id}
Worker -> Redis: HSET hero:job:{job_id} status "started"
Worker: Execute script
Worker -> Redis: HSET hero:job:{job_id} status "finished" output "{result}"
Actor -> Redis: BLPOP hero:work_queue:{actor_id}
Actor -> Redis: HSET hero:job:{job_id} status "started"
Actor: Execute script
Actor -> Redis: HSET hero:job:{job_id} status "finished" output "{result}"
```
### 4. Job Completion (Async)
```
Worker -> Redis: LPUSH hero:reply:{job_id} {result}
Actor -> Redis: LPUSH hero:reply:{job_id} {result}
```
## API Operations
@@ -110,7 +110,7 @@ supervisor.list_jobs() -> Vec<String>
supervisor.stop_job(job_id) -> Result<(), SupervisorError>
```
**Redis Operations:**
- `LPUSH hero:stop_queue:{worker_id} {job_id}` - Send stop request
- `LPUSH hero:stop_queue:{actor_id} {job_id}` - Send stop request
### Get Job Status
```rust
@@ -131,20 +131,20 @@ supervisor.get_job_logs(job_id) -> Result<Option<String>, SupervisorError>
### Run Job and Await Result
```rust
supervisor.run_job_and_await_result(job, worker_id) -> Result<String, SupervisorError>
supervisor.run_job_and_await_result(job, actor_id) -> Result<String, SupervisorError>
```
**Redis Operations:**
1. `HSET hero:job:{job_id} {job_fields}` - Store job
2. `LPUSH hero:work_queue:{worker_id} {job_id}` - Submit job
2. `LPUSH hero:work_queue:{actor_id} {job_id}` - Submit job
3. `BLPOP hero:reply:{job_id} {timeout}` - Wait for result
## Worker Protocol
## Actor Protocol
### Job Processing Loop
```rust
loop {
// 1. Wait for job
job_id = BLPOP hero:work_queue:{worker_id}
job_id = BLPOP hero:work_queue:{actor_id}
// 2. Get job details
job_data = HGETALL hero:job:{job_id}
@@ -153,8 +153,8 @@ loop {
HSET hero:job:{job_id} status "started"
// 4. Check for stop requests
if LLEN hero:stop_queue:{worker_id} > 0 {
stop_job_id = LPOP hero:stop_queue:{worker_id}
if LLEN hero:stop_queue:{actor_id} > 0 {
stop_job_id = LPOP hero:stop_queue:{actor_id}
if stop_job_id == job_id {
HSET hero:job:{job_id} status "error" error "stopped"
continue
@@ -175,15 +175,15 @@ loop {
```
### Stop Request Handling
Workers should periodically check the stop queue during long-running jobs:
Actors should periodically check the stop queue during long-running jobs:
```rust
if LLEN hero:stop_queue:{worker_id} > 0 {
stop_requests = LRANGE hero:stop_queue:{worker_id} 0 -1
if LLEN hero:stop_queue:{actor_id} > 0 {
stop_requests = LRANGE hero:stop_queue:{actor_id} 0 -1
if stop_requests.contains(current_job_id) {
// Stop current job execution
HSET hero:job:{current_job_id} status "error" error "stopped_by_request"
// Remove stop request
LREM hero:stop_queue:{worker_id} 1 current_job_id
LREM hero:stop_queue:{actor_id} 1 current_job_id
return
}
}
@@ -193,17 +193,17 @@ if LLEN hero:stop_queue:{worker_id} > 0 {
### Job Timeouts
- Client sets timeout when creating job
- Worker should respect timeout and stop execution
- Actor should respect timeout and stop execution
- If timeout exceeded: `HSET hero:job:{job_id} status "error" error "timeout"`
### Worker Failures
- If worker crashes, job remains in "started" status
### Actor Failures
- If actor crashes, job remains in "started" status
- Monitoring systems can detect stale jobs and retry
- Jobs can be requeued: `LPUSH hero:work_queue:{worker_id} {job_id}`
- Jobs can be requeued: `LPUSH hero:work_queue:{actor_id} {job_id}`
### Redis Connection Issues
- Clients should implement retry logic with exponential backoff
- Workers should reconnect and resume processing
- Actors should reconnect and resume processing
- Use Redis persistence to survive Redis restarts
## Monitoring and Observability
@@ -211,10 +211,10 @@ if LLEN hero:stop_queue:{worker_id} > 0 {
### Queue Monitoring
```bash
# Check work queue length
LLEN hero:work_queue:{worker_id}
LLEN hero:work_queue:{actor_id}
# Check stop queue length
LLEN hero:stop_queue:{worker_id}
LLEN hero:stop_queue:{actor_id}
# List all jobs
KEYS hero:job:*
@@ -228,7 +228,7 @@ HGETALL hero:job:{job_id}
- Jobs completed per second
- Average job execution time
- Queue depths
- Worker availability
- Actor availability
- Error rates by job type
## Security Considerations
@@ -237,7 +237,7 @@ HGETALL hero:job:{job_id}
- Use Redis AUTH for authentication
- Enable TLS for Redis connections
- Restrict Redis network access
- Use Redis ACLs to limit worker permissions
- Use Redis ACLs to limit actor permissions
### Job Security
- Validate script content before execution
@@ -265,8 +265,8 @@ HGETALL hero:job:{job_id}
- Batch similar jobs when possible
- Implement job prioritization if needed
### Worker Optimization
- Pool worker connections to Redis
### Actor Optimization
- Pool actor connections to Redis
- Use async I/O for Redis operations
- Implement graceful shutdown handling
- Monitor worker resource usage
- Monitor actor resource usage