...

2025-03-09 18:34:58 +01:00
parent e4045ef179
commit 1489b9f44f
2 changed files with 184 additions and 0 deletions
--- a/lib/concurrency_instructions.md
+++ b/lib/concurrency_instructions.md
--- a/lib/core/jobs/model/README.md
+++ b/lib/core/jobs/model/README.md
@@ -0,0 +1,184 @@
+# HeroLib Job Manager
+
+This document explains the job management system in HeroLib, which is designed to coordinate distributed task execution across multiple agents.
+
+## Core Components
+
+### 1. Job System
+
+The job system is the central component that manages tasks to be executed by agents. It consists of:
+
+- **Job**: Represents a task to be executed by an agent. Each job has:
+  - A unique GUID
+  - Target agents (public keys of agents that can execute the job)
+  - Source (public key of the agent requesting the job)
+  - Circle and context (organizational structure)
+  - Actor and action (what needs to be executed)
+  - Parameters (data needed for execution)
+  - Timeout settings
+  - Status information
+  - Dependencies on other jobs
+
+- **JobStatus**: Tracks the state of a job through its lifecycle:
+  - created → scheduled → planned → running → ok/error
+
+- **JobManager**: Handles CRUD operations for jobs, storing them in Redis under the `herorunner:jobs` key.
+
+### 2. Agent System
+
+The agent system represents the entities that can execute jobs:
+
+- **Agent**: Represents a service provider that can execute jobs. Each agent has:
+  - A public key (identifier)
+  - Network address and port
+  - Status information
+  - List of services it provides
+  - Cryptographic signature for verification
+
+- **AgentService**: Represents a service provided by an agent, with:
+  - Actor name
+  - Available actions
+  - Status information
+
+- **AgentManager**: Handles CRUD operations for agents, storing them in Redis under the `herorunner:agents` key.
+
+### 3. Service System
+
+The service system defines the capabilities available in the system:
+
+- **Service**: Represents a capability that can be provided by agents. Each service has:
+  - Actor name
+  - Available actions
+  - Status information
+  - Optional access control list
+
+- **ServiceAction**: Represents an action that can be performed by a service, with:
+  - Action name
+  - Parameters
+  - Optional access control list
+
+- **ServiceManager**: Handles CRUD operations for services, storing them in Redis under the `herorunner:services` key.
+
+### 4. Access Control System
+
+The access control system manages permissions:
+
+- **Group**: Represents a collection of members (users or other groups)
+- **ACL**: Access Control List containing multiple ACEs
+- **ACE**: Access Control Entry defining permissions for users or groups
+- **GroupManager**: Handles CRUD operations for groups, storing them in Redis under the `herorunner:groups` key.
+
+### 5. HeroRunner
+
+The `HeroRunner` is the main factory that brings all components together, providing a unified interface to the job management system.
+
+## How It Works
+
+1. **Job Creation and Scheduling**:
+   - A client creates a job with specific actor, action, and parameters
+   - The job is stored in Redis with status "created"
+   - The job can specify dependencies on other jobs
+
+2. **Agent Registration**:
+   - Agents register themselves with their public key, address, and services
+   - Each agent provides a list of services (actors) and actions it can perform
+   - Agents periodically update their status
+
+3. **Service Discovery**:
+   - Services define the capabilities available in the system
+   - Each service has a list of actions it can perform
+   - Services can have access control to restrict who can use them
+
+4. **Job Execution**:
+   - The herorunner process monitors jobs in Redis
+   - When a job is ready (dependencies satisfied), it changes status to "scheduled"
+   - The herorunner forwards the job to an appropriate agent
+   - The agent changes job status to "planned", then "running", and finally "ok" or "error"
+   - If an agent fails, the herorunner can retry with another agent
+
+5. **Access Control**:
+   - Users and groups are organized in a hierarchical structure
+   - ACLs define who can access which services and actions
+   - The service manager checks access permissions before allowing job execution
+
+## Data Storage
+
+All data is stored in Redis using the following keys:
+- `herorunner:jobs` - Hash map of job GUIDs to job JSON
+- `herorunner:agents` - Hash map of agent public keys to agent JSON
+- `herorunner:services` - Hash map of service actor names to service JSON
+- `herorunner:groups` - Hash map of group GUIDs to group JSON
+
+## Potential Issues
+
+1. **Concurrency Management**:
+   - The current implementation doesn't have explicit locking mechanisms for concurrent access to Redis
+   - Race conditions could occur if multiple processes update the same job simultaneously
+
+2. **Error Handling**:
+   - While there are error states, the error handling is minimal
+   - There's no robust mechanism for retrying failed jobs or handling partial failures
+
+3. **Dependency Resolution**:
+   - The code for resolving job dependencies is not fully implemented
+   - It's unclear how circular dependencies would be handled
+
+4. **Security Concerns**:
+   - While there's a signature field in the Agent struct, the verification process is not evident
+   - The ACL system is basic and might not handle complex permission scenarios
+
+5. **Scalability**:
+   - All data is stored in Redis, which could become a bottleneck with a large number of jobs
+   - There's no apparent sharding or partitioning strategy
+
+6. **Monitoring and Observability**:
+   - Limited mechanisms for monitoring the system's health
+   - No built-in logging or metrics collection
+
+## Recommendations
+
+1. Implement proper concurrency control using Redis transactions or locks
+2. Enhance error handling with more detailed error states and recovery mechanisms
+3. Develop a robust dependency resolution system with cycle detection
+4. Strengthen security by implementing proper signature verification and enhancing the ACL system
+5. Consider a more scalable storage solution for large deployments
+6. Add comprehensive logging and monitoring capabilities
+
+## Usage Example
+
+```v
+// Initialize the HeroRunner
+mut hr := model.new()!
+
+// Create a new job
+mut job := hr.jobs.new()
+job.guid = 'job-123'
+job.actor = 'vm_manager'
+job.action = 'start'
+job.params['id'] = '10'
+hr.jobs.set(job)!
+
+// Register an agent
+mut agent := hr.agents.new()
+agent.pubkey = 'agent-456'
+agent.address = '192.168.1.100'
+agent.services << model.AgentService{
+    actor: 'vm_manager'
+    actions: [
+        model.AgentServiceAction{
+            action: 'start'
+            params: {'id': 'string'}
+        }
+    ]
+}
+hr.agents.set(agent)!
+
+// Define a service
+mut service := hr.services.new()
+service.actor = 'vm_manager'
+service.actions << model.ServiceAction{
+    action: 'start'
+    params: {'id': 'string'}
+}
+hr.services.set(service)!
+```