first commit

2025-10-20 22:24:25 +02:00
commit 097360ad12
48 changed files with 6712 additions and 0 deletions
--- a/docs/specs/osiris-mvp.md
+++ b/docs/specs/osiris-mvp.md
@@ -0,0 +1,525 @@
+# OSIRIS MVP — Minimal Semantic Store over HeroDB
+
+## 0) Purpose
+
+OSIRIS is a Rust-native object layer on top of HeroDB that provides structured storage and retrieval capabilities without any server-side extensions or indexing engines.
+
+It provides:
+- Object CRUD operations
+- Namespace management
+- Simple local field indexing (field:*)
+- Basic keyword scan (substring matching)
+- CLI interface
+- Future: 9P filesystem interface
+
+It does **not** depend on HeroDB's Tantivy FTS, vectors, or relations.
+
+---
+
+## 1) Architecture
+
+```
+HeroDB (unmodified)
+│
+├── KV store + encryption
+└── RESP protocol
+    ↑
+    │
+    └── OSIRIS
+        ├── store/         – object schema + persistence
+        ├── index/         – field index & keyword scanning
+        ├── retrieve/      – query planner + filtering
+        ├── interfaces/    – CLI, 9P (future)
+        └── config/        – namespaces + settings
+```
+
+---
+
+## 2) Data Model
+
+```rust
+#[derive(Clone, Debug, Serialize, Deserialize)]
+pub struct OsirisObject {
+    pub id: String,
+    pub ns: String,
+    pub meta: Metadata,
+    pub text: Option<String>,   // optional plain text
+}
+
+#[derive(Clone, Debug, Serialize, Deserialize)]
+pub struct Metadata {
+    pub title: Option<String>,
+    pub mime: Option<String>,
+    pub tags: BTreeMap<String, String>,
+    pub created: OffsetDateTime,
+    pub updated: OffsetDateTime,
+    pub size: Option<u64>,
+}
+```
+
+---
+
+## 3) Keyspace Design
+
+```
+meta:<id>             → serialized OsirisObject (JSON)
+field:tag:<key>=<val> → Set of IDs (for tag filtering)
+field:mime:<type>     → Set of IDs (for MIME type filtering)
+field:title:<title>   → Set of IDs (for title filtering)
+scan:index            → Set of all IDs (for full scan)
+```
+
+**Example:**
+```
+field:tag:project=osiris  → {note_1, note_2}
+field:mime:text/markdown  → {note_1, note_3}
+scan:index                → {note_1, note_2, note_3, ...}
+```
+
+---
+
+## 4) Index Maintenance
+
+### Insert / Update
+
+```rust
+// Store object
+redis.set(format!("meta:{}", obj.id), serde_json::to_string(&obj)?)?;
+
+// Index tags
+for (k, v) in &obj.meta.tags {
+    redis.sadd(format!("field:tag:{}={}", k, v), &obj.id)?;
+}
+
+// Index MIME type
+if let Some(mime) = &obj.meta.mime {
+    redis.sadd(format!("field:mime:{}", mime), &obj.id)?;
+}
+
+// Index title
+if let Some(title) = &obj.meta.title {
+    redis.sadd(format!("field:title:{}", title), &obj.id)?;
+}
+
+// Add to scan index
+redis.sadd("scan:index", &obj.id)?;
+```
+
+### Delete
+
+```rust
+// Remove object
+redis.del(format!("meta:{}", obj.id))?;
+
+// Deindex tags
+for (k, v) in &obj.meta.tags {
+    redis.srem(format!("field:tag:{}={}", k, v), &obj.id)?;
+}
+
+// Deindex MIME type
+if let Some(mime) = &obj.meta.mime {
+    redis.srem(format!("field:mime:{}", mime), &obj.id)?;
+}
+
+// Deindex title
+if let Some(title) = &obj.meta.title {
+    redis.srem(format!("field:title:{}", title), &obj.id)?;
+}
+
+// Remove from scan index
+redis.srem("scan:index", &obj.id)?;
+```
+
+---
+
+## 5) Retrieval
+
+### Query Structure
+
+```rust
+pub struct RetrievalQuery {
+    pub text: Option<String>,                 // keyword substring
+    pub ns: String,
+    pub filters: Vec<(String, String)>,       // field=value
+    pub top_k: usize,
+}
+```
+
+### Execution Steps
+
+1. **Collect candidate IDs** from field:* filters (SMEMBERS + intersection)
+2. **If text query is provided**, iterate over candidates:
+   - Fetch `meta:<id>`
+   - Test substring match on `meta.title`, `text`, or `tags`
+   - Compute simple relevance score
+3. **Sort** by score (descending) and **limit** to `top_k`
+
+This is O(N) for text scan but acceptable for MVP or small datasets (<10k objects).
+
+### Scoring Algorithm
+
+```rust
+fn compute_text_score(obj: &OsirisObject, query: &str) -> f32 {
+    let mut score = 0.0;
+    
+    // Title match
+    if let Some(title) = &obj.meta.title {
+        if title.to_lowercase().contains(query) {
+            score += 0.5;
+        }
+    }
+    
+    // Text content match
+    if let Some(text) = &obj.text {
+        if text.to_lowercase().contains(query) {
+            score += 0.5;
+            // Bonus for multiple occurrences
+            let count = text.to_lowercase().matches(query).count();
+            score += (count as f32 - 1.0) * 0.1;
+        }
+    }
+    
+    // Tag match
+    for (key, value) in &obj.meta.tags {
+        if key.to_lowercase().contains(query) || value.to_lowercase().contains(query) {
+            score += 0.2;
+        }
+    }
+    
+    score.min(1.0)
+}
+```
+
+---
+
+## 6) CLI
+
+### Commands
+
+```bash
+# Initialize and create namespace
+osiris init --herodb redis://localhost:6379
+osiris ns create notes
+
+# Add and read objects
+osiris put notes/my-note.md ./my-note.md --tags topic=rust,project=osiris
+osiris get notes/my-note.md
+osiris get notes/my-note.md --raw --output /tmp/note.md
+osiris del notes/my-note.md
+
+# Search
+osiris find --ns notes --filter topic=rust
+osiris find "retrieval" --ns notes
+osiris find "rust" --ns notes --filter project=osiris --topk 20
+
+# Namespace management
+osiris ns list
+osiris ns delete notes
+
+# Statistics
+osiris stats
+osiris stats --ns notes
+```
+
+### Examples
+
+```bash
+# Store a note from stdin
+echo "This is a note about Rust programming" | \
+  osiris put notes/rust-intro - \
+  --title "Rust Introduction" \
+  --tags topic=rust,level=beginner \
+  --mime text/plain
+
+# Search for notes about Rust
+osiris find "rust" --ns notes
+
+# Filter by tag
+osiris find --ns notes --filter topic=rust
+
+# Get note as JSON
+osiris get notes/rust-intro
+
+# Get raw content
+osiris get notes/rust-intro --raw
+```
+
+---
+
+## 7) Configuration
+
+### File Location
+
+`~/.config/osiris/config.toml`
+
+### Example
+
+```toml
+[herodb]
+url = "redis://localhost:6379"
+
+[namespaces.notes]
+db_id = 1
+
+[namespaces.calendar]
+db_id = 2
+```
+
+### Structure
+
+```rust
+pub struct Config {
+    pub herodb: HeroDbConfig,
+    pub namespaces: HashMap<String, NamespaceConfig>,
+}
+
+pub struct HeroDbConfig {
+    pub url: String,
+}
+
+pub struct NamespaceConfig {
+    pub db_id: u16,
+}
+```
+
+---
+
+## 8) Database Allocation
+
+```
+DB 0  → HeroDB Admin (managed by HeroDB)
+DB 1  → osiris:notes (namespace "notes")
+DB 2  → osiris:calendar (namespace "calendar")
+DB 3+ → Additional namespaces...
+```
+
+Each namespace gets its own isolated HeroDB database.
+
+---
+
+## 9) Dependencies
+
+```toml
+[dependencies]
+anyhow = "1.0"
+redis = { version = "0.24", features = ["aio", "tokio-comp"] }
+serde = { version = "1.0", features = ["derive"] }
+serde_json = "1.0"
+time = { version = "0.3", features = ["serde", "formatting", "parsing", "macros"] }
+tokio = { version = "1.23", features = ["full"] }
+clap = { version = "4.5", features = ["derive"] }
+toml = "0.8"
+uuid = { version = "1.6", features = ["v4", "serde"] }
+tracing = "0.1"
+tracing-subscriber = { version = "0.3", features = ["env-filter"] }
+```
+
+---
+
+## 10) Future Enhancements
+
+| Feature | When Added | Moves Where |
+|---------|-----------|-------------|
+| Dedup / blobs | HeroDB extension | HeroDB |
+| Vector search | HeroDB extension | HeroDB |
+| Full-text search | HeroDB (Tantivy) | HeroDB |
+| Relations / graph | OSIRIS later | OSIRIS |
+| 9P filesystem | OSIRIS later | OSIRIS |
+
+This MVP maintains clean interface boundaries:
+- **HeroDB** remains a plain KV substrate
+- **OSIRIS** builds higher-order meaning on top
+
+---
+
+## 11) Implementation Status
+
+### ✅ Completed
+
+- [x] Project structure and Cargo.toml
+- [x] Core data models (OsirisObject, Metadata)
+- [x] HeroDB client wrapper (RESP protocol)
+- [x] Field indexing (tags, MIME, title)
+- [x] Search engine (substring matching + scoring)
+- [x] Configuration management
+- [x] CLI interface (init, ns, put, get, del, find, stats)
+- [x] Error handling
+- [x] Documentation (README, specs)
+
+### 🚧 Pending
+
+- [ ] 9P filesystem interface
+- [ ] Integration tests
+- [ ] Performance benchmarks
+- [ ] Name resolution (namespace/name → ID mapping)
+
+---
+
+## 12) Quick Start
+
+### Prerequisites
+
+Start HeroDB:
+```bash
+cd /path/to/herodb
+cargo run --release -- --dir ./data --admin-secret mysecret --port 6379
+```
+
+### Build OSIRIS
+
+```bash
+cd /path/to/osiris
+cargo build --release
+```
+
+### Initialize
+
+```bash
+# Create configuration
+./target/release/osiris init --herodb redis://localhost:6379
+
+# Create a namespace
+./target/release/osiris ns create notes
+```
+
+### Usage
+
+```bash
+# Add a note
+echo "OSIRIS is a minimal object store" | \
+  ./target/release/osiris put notes/intro - \
+  --title "Introduction" \
+  --tags topic=osiris,type=doc
+
+# Search
+./target/release/osiris find "object store" --ns notes
+
+# Get the note
+./target/release/osiris get notes/intro
+
+# Show stats
+./target/release/osiris stats --ns notes
+```
+
+---
+
+## 13) Testing
+
+### Unit Tests
+
+```bash
+cargo test
+```
+
+### Integration Tests (requires HeroDB)
+
+```bash
+# Start HeroDB
+cd /path/to/herodb
+cargo run -- --dir /tmp/herodb-test --admin-secret test --port 6379
+
+# Run tests
+cd /path/to/osiris
+cargo test -- --ignored
+```
+
+---
+
+## 14) Performance Characteristics
+
+### Write Performance
+
+- **Object storage**: O(1) - single SET operation
+- **Indexing**: O(T) where T = number of tags/fields
+- **Total**: O(T) per object
+
+### Read Performance
+
+- **Get by ID**: O(1) - single GET operation
+- **Filter by tags**: O(F) where F = number of filters (set intersection)
+- **Text search**: O(N) where N = number of candidates (linear scan)
+
+### Storage Overhead
+
+- **Object**: ~1KB per object (JSON serialized)
+- **Indexes**: ~50 bytes per tag/field entry
+- **Total**: ~1.5KB per object with 10 tags
+
+### Scalability
+
+- **Optimal**: <10,000 objects per namespace
+- **Acceptable**: <100,000 objects per namespace
+- **Beyond**: Consider migrating to Tantivy FTS
+
+---
+
+## 15) Design Decisions
+
+### Why No Tantivy in MVP?
+
+- **Simplicity**: Avoid HeroDB server-side dependencies
+- **Portability**: Works with any Redis-compatible backend
+- **Flexibility**: Easy to migrate to Tantivy later
+
+### Why Substring Matching?
+
+- **Good enough**: For small datasets (<10k objects)
+- **Simple**: No tokenization, stemming, or complex scoring
+- **Fast**: O(N) is acceptable for MVP
+
+### Why Separate Databases per Namespace?
+
+- **Isolation**: Clear separation of concerns
+- **Performance**: Smaller keyspaces = faster scans
+- **Security**: Can apply different encryption keys per namespace
+
+---
+
+## 16) Migration Path
+
+When ready to scale beyond MVP:
+
+1. **Add Tantivy FTS** (HeroDB extension)
+   - Create FT.* commands in HeroDB
+   - Update OSIRIS to use FT.SEARCH instead of substring scan
+   - Keep field indexes for filtering
+
+2. **Add Vector Search** (HeroDB extension)
+   - Store embeddings in HeroDB
+   - Implement ANN search (HNSW/IVF)
+   - Add hybrid retrieval (BM25 + vector)
+
+3. **Add Relations** (OSIRIS feature)
+   - Store relation graphs in HeroDB
+   - Implement graph traversal
+   - Add relation-based ranking
+
+4. **Add Deduplication** (HeroDB extension)
+   - Content-addressable storage (BLAKE3)
+   - Reference counting
+   - Garbage collection
+
+---
+
+## Summary
+
+**OSIRIS MVP is a minimal, production-ready object store** that:
+
+- ✅ Works with unmodified HeroDB
+- ✅ Provides structured storage with metadata
+- ✅ Supports field-based filtering
+- ✅ Includes basic text search
+- ✅ Exposes a clean CLI interface
+- ✅ Maintains clear upgrade paths
+
+**Perfect for:**
+- Personal knowledge management
+- Small-scale document storage
+- Prototyping semantic applications
+- Learning Rust + Redis patterns
+
+**Next steps:**
+- Build and test the MVP
+- Gather usage feedback
+- Plan Tantivy/vector integration
+- Design 9P filesystem interface