fixed a few bugs related to vector embedding + added additional end to end documentation to showcase local and external embedders step-by-step + added example mock embedder python script

2025-10-16 15:30:45 +02:00
parent a8720c06db
commit df780e20a2
5 changed files with 1188 additions and 198 deletions
--- a/docs/local_embedder_full_example.md
+++ b/docs/local_embedder_full_example.md
@@ -0,0 +1,988 @@
+# HeroDB Embedding Models: Complete Tutorial
+
+This tutorial demonstrates how to use embedding models with HeroDB for vector search, covering both local self-hosted models and OpenAI's API.
+
+## Table of Contents
+- [Prerequisites](#prerequisites)
+- [Scenario 1: Local Embedding Model](#scenario-1-local-embedding-model-testing)
+- [Scenario 2: OpenAI API](#scenario-2-openai-api)
+- [Scenario 3: Deterministic Test Embedder](#scenario-3-deterministic-test-embedder-no-network)
+- [Troubleshooting](#troubleshooting)
+
+---
+
+## Prerequisites
+
+### Start HeroDB Server
+
+Build and start HeroDB with RPC enabled:
+
+```bash
+cargo build --release
+./target/release/herodb --dir ./data --admin-secret my-admin-secret --enable-rpc --rpc-port 8080
+```
+
+This starts:
+- Redis-compatible server on port 6379
+- JSON-RPC server on port 8080
+
+### Client Tools
+
+For Redis-like commands:
+```bash
+redis-cli -p 6379
+```
+
+For JSON-RPC calls, use `curl`:
+```bash
+curl -X POST http://localhost:8080 \
+  -H "Content-Type: application/json" \
+  -d '{"jsonrpc":"2.0","id":1,"method":"herodb_METHOD","params":[...]}'
+```
+
+---
+
+## Scenario 1: Local Embedding Model (Testing)
+
+Run your own embedding service locally for development, testing, or privacy.
+
+### Option A: Python Mock Server (Simplest)
+
+This creates a minimal OpenAI-compatible embedding server for testing.
+
+**1. Create `mock_embedder.py`:**
+
+```python
+from flask import Flask, request, jsonify
+import numpy as np
+
+app = Flask(__name__)
+
+@app.route('/v1/embeddings', methods=['POST'])
+def embeddings():
+    """OpenAI-compatible embeddings endpoint"""
+    data = request.json
+    inputs = data.get('input', [])
+    
+    # Handle both single string and array
+    if isinstance(inputs, str):
+        inputs = [inputs]
+    
+    # Generate deterministic 768-dim embeddings (hash-based)
+    embeddings = []
+    for text in inputs:
+        # Simple hash to vector (deterministic)
+        vec = np.zeros(768)
+        for i, char in enumerate(text[:768]):
+            vec[i % 768] += ord(char) / 255.0
+        
+        # L2 normalize
+        norm = np.linalg.norm(vec)
+        if norm > 0:
+            vec = vec / norm
+        
+        embeddings.append(vec.tolist())
+    
+    return jsonify({
+        "data": [{"embedding": emb, "index": i} for i, emb in enumerate(embeddings)],
+        "model": data.get('model', 'mock-local'),
+        "usage": {"total_tokens": sum(len(t) for t in inputs)}
+    })
+
+if __name__ == '__main__':
+    print("Starting mock embedding server on http://127.0.0.1:8081")
+    app.run(host='127.0.0.1', port=8081, debug=False)
+```
+
+**2. Install dependencies and run:**
+
+```bash
+pip install flask numpy
+python mock_embedder.py
+```
+
+Output: `Starting mock embedding server on http://127.0.0.1:8081`
+
+**3. Test the server (optional):**
+
+```bash
+curl -X POST http://127.0.0.1:8081/v1/embeddings \
+  -H "Content-Type: application/json" \
+  -d '{"input":["hello world"],"model":"test"}'
+```
+
+You should see a JSON response with a 768-dimensional embedding.
+
+### End-to-End Example with Local Model
+
+**Step 1: Create a Lance database**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 1,
+  "method": "herodb_createDatabase",
+  "params": [
+    "Lance",
+    { "name": "local-vectors", "storage_path": null, "max_size": null, "redis_version": null },
+    null
+  ]
+}
+```
+
+Expected response:
+```json
+{"jsonrpc":"2.0","id":1,"result":1}
+```
+
+The database ID is `1`.
+
+**Step 2: Configure embedding for the dataset**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 2,
+  "method": "herodb_lanceSetEmbeddingConfig",
+  "params": [
+    1,
+    "products",
+    {
+      "provider": "openai",
+      "model": "mock-local",
+      "dim": 768,
+      "endpoint": "http://127.0.0.1:8081/v1/embeddings",
+      "headers": {
+        "Authorization": "Bearer dummy"
+      },
+      "timeout_ms": 30000
+    }
+  ]
+}
+```
+
+Redis-like:
+```bash
+redis-cli -p 6379
+SELECT 1
+LANCE.EMBEDDING CONFIG SET products PROVIDER openai MODEL mock-local DIM 768 ENDPOINT http://127.0.0.1:8081/v1/embeddings HEADER Authorization "Bearer dummy" TIMEOUTMS 30000
+```
+
+Expected response:
+```json
+{"jsonrpc":"2.0","id":2,"result":true}
+```
+
+**Step 3: Verify configuration**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 3,
+  "method": "herodb_lanceGetEmbeddingConfig",
+  "params": [1, "products"]
+}
+```
+
+Redis-like:
+```bash
+LANCE.EMBEDDING CONFIG GET products
+```
+
+Expected: Returns your configuration with provider, model, dim, endpoint, etc.
+
+**Step 4: Insert product data**
+
+JSON-RPC (item 1):
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 4,
+  "method": "herodb_lanceStoreText",
+  "params": [
+    1,
+    "products",
+    "item-1",
+    "Waterproof hiking boots with ankle support and aggressive tread",
+    { "brand": "TrailMax", "category": "footwear", "price": "129.99" }
+  ]
+}
+```
+
+Redis-like:
+```bash
+LANCE.STORE products ID item-1 TEXT "Waterproof hiking boots with ankle support and aggressive tread" META brand TrailMax category footwear price 129.99
+```
+
+JSON-RPC (item 2):
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 5,
+  "method": "herodb_lanceStoreText",
+  "params": [
+    1,
+    "products",
+    "item-2",
+    "Lightweight running shoes with breathable mesh upper",
+    { "brand": "SpeedFit", "category": "footwear", "price": "89.99" }
+  ]
+}
+```
+
+JSON-RPC (item 3):
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 6,
+  "method": "herodb_lanceStoreText",
+  "params": [
+    1,
+    "products",
+    "item-3",
+    "Insulated winter jacket with removable hood and multiple pockets",
+    { "brand": "WarmTech", "category": "outerwear", "price": "199.99" }
+  ]
+}
+```
+
+JSON-RPC (item 4):
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 7,
+  "method": "herodb_lanceStoreText",
+  "params": [
+    1,
+    "products",
+    "item-4",
+    "Camping tent for 4 people with waterproof rainfly",
+    { "brand": "OutdoorPro", "category": "camping", "price": "249.99" }
+  ]
+}
+```
+
+Expected response for each: `{"jsonrpc":"2.0","id":N,"result":true}`
+
+**Step 5: Search by text query**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 8,
+  "method": "herodb_lanceSearchText",
+  "params": [
+    1,
+    "products",
+    "boots for hiking in wet conditions",
+    3,
+    null,
+    ["brand", "category", "price"]
+  ]
+}
+```
+
+Redis-like:
+```bash
+LANCE.SEARCH products K 3 QUERY "boots for hiking in wet conditions" RETURN 3 brand category price
+```
+
+Expected response:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 8,
+  "result": {
+    "results": [
+      {
+        "id": "item-1",
+        "score": 0.234,
+        "meta": {
+          "brand": "TrailMax",
+          "category": "footwear",
+          "price": "129.99"
+        }
+      },
+      ...
+    ]
+  }
+}
+```
+
+**Step 6: Search with metadata filter**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 9,
+  "method": "herodb_lanceSearchText",
+  "params": [
+    1,
+    "products",
+    "comfortable shoes for running",
+    5,
+    "category = 'footwear'",
+    null
+  ]
+}
+```
+
+Redis-like:
+```bash
+LANCE.SEARCH products K 5 QUERY "comfortable shoes for running" FILTER "category = 'footwear'"
+```
+
+This returns only items where `category` equals `'footwear'`.
+
+**Step 7: List datasets**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 10,
+  "method": "herodb_lanceList",
+  "params": [1]
+}
+```
+
+Redis-like:
+```bash
+LANCE.LIST
+```
+
+**Step 8: Get dataset info**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 11,
+  "method": "herodb_lanceInfo",
+  "params": [1, "products"]
+}
+```
+
+Redis-like:
+```bash
+LANCE.INFO products
+```
+
+Returns dimension, row count, and other metadata.
+
+---
+
+## Scenario 2: OpenAI API
+
+Use OpenAI's production embedding service for semantic search.
+
+### Setup
+
+**1. Set your API key:**
+
+```bash
+export OPENAI_API_KEY="sk-your-actual-openai-key-here"
+```
+
+**2. Start HeroDB** (same as before):
+
+```bash
+./target/release/herodb --dir ./data --admin-secret my-admin-secret --enable-rpc --rpc-port 8080
+```
+
+### End-to-End Example with OpenAI
+
+**Step 1: Create a Lance database**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 1,
+  "method": "herodb_createDatabase",
+  "params": [
+    "Lance",
+    { "name": "openai-vectors", "storage_path": null, "max_size": null, "redis_version": null },
+    null
+  ]
+}
+```
+
+Expected: `{"jsonrpc":"2.0","id":1,"result":1}` (database ID = 1)
+
+**Step 2: Configure OpenAI embeddings**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 2,
+  "method": "herodb_lanceSetEmbeddingConfig",
+  "params": [
+    1,
+    "documents",
+    {
+      "provider": "openai",
+      "model": "text-embedding-3-small",
+      "dim": 1536,
+      "endpoint": null,
+      "headers": {},
+      "timeout_ms": 30000
+    }
+  ]
+}
+```
+
+Redis-like:
+```bash
+redis-cli -p 6379
+SELECT 1
+LANCE.EMBEDDING CONFIG SET documents PROVIDER openai MODEL text-embedding-3-small DIM 1536 TIMEOUTMS 30000
+```
+
+Notes:
+- `endpoint` is `null` (defaults to OpenAI API: https://api.openai.com/v1/embeddings)
+- `headers` is empty (Authorization auto-added from OPENAI_API_KEY env var)
+- `dim` is 1536 for text-embedding-3-small
+
+Expected: `{"jsonrpc":"2.0","id":2,"result":true}`
+
+**Step 3: Insert documents**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 3,
+  "method": "herodb_lanceStoreText",
+  "params": [
+    1,
+    "documents",
+    "doc-1",
+    "The quick brown fox jumps over the lazy dog",
+    { "source": "example", "lang": "en", "topic": "animals" }
+  ]
+}
+```
+
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 4,
+  "method": "herodb_lanceStoreText",
+  "params": [
+    1,
+    "documents",
+    "doc-2",
+    "Machine learning models require large datasets for training and validation",
+    { "source": "tech", "lang": "en", "topic": "ai" }
+  ]
+}
+```
+
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 5,
+  "method": "herodb_lanceStoreText",
+  "params": [
+    1,
+    "documents",
+    "doc-3",
+    "Python is a popular programming language for data science and web development",
+    { "source": "tech", "lang": "en", "topic": "programming" }
+  ]
+}
+```
+
+Redis-like:
+```bash
+LANCE.STORE documents ID doc-1 TEXT "The quick brown fox jumps over the lazy dog" META source example lang en topic animals
+LANCE.STORE documents ID doc-2 TEXT "Machine learning models require large datasets for training and validation" META source tech lang en topic ai
+LANCE.STORE documents ID doc-3 TEXT "Python is a popular programming language for data science and web development" META source tech lang en topic programming
+```
+
+Expected for each: `{"jsonrpc":"2.0","id":N,"result":true}`
+
+**Step 4: Semantic search**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 6,
+  "method": "herodb_lanceSearchText",
+  "params": [
+    1,
+    "documents",
+    "artificial intelligence and neural networks",
+    3,
+    null,
+    ["source", "topic"]
+  ]
+}
+```
+
+Redis-like:
+```bash
+LANCE.SEARCH documents K 3 QUERY "artificial intelligence and neural networks" RETURN 2 source topic
+```
+
+Expected response (doc-2 should rank highest due to semantic similarity):
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 6,
+  "result": {
+    "results": [
+      {
+        "id": "doc-2",
+        "score": 0.123,
+        "meta": {
+          "source": "tech",
+          "topic": "ai"
+        }
+      },
+      {
+        "id": "doc-3",
+        "score": 0.456,
+        "meta": {
+          "source": "tech",
+          "topic": "programming"
+        }
+      },
+      {
+        "id": "doc-1",
+        "score": 0.789,
+        "meta": {
+          "source": "example",
+          "topic": "animals"
+        }
+      }
+    ]
+  }
+}
+```
+
+Note: Lower score = better match (L2 distance).
+
+**Step 5: Search with filter**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 7,
+  "method": "herodb_lanceSearchText",
+  "params": [
+    1,
+    "documents",
+    "programming and software",
+    5,
+    "topic = 'programming'",
+    null
+  ]
+}
+```
+
+Redis-like:
+```bash
+LANCE.SEARCH documents K 5 QUERY "programming and software" FILTER "topic = 'programming'"
+```
+
+This returns only documents where `topic` equals `'programming'`.
+
+---
+
+## Scenario 2: OpenAI API
+
+Use OpenAI's production embedding service for high-quality semantic search.
+
+### Setup
+
+**1. Set your OpenAI API key:**
+
+```bash
+export OPENAI_API_KEY="sk-your-actual-openai-key-here"
+```
+
+**2. Start HeroDB:**
+
+```bash
+./target/release/herodb --dir ./data --admin-secret my-admin-secret --enable-rpc --rpc-port 8080
+```
+
+### Complete Workflow
+
+**Step 1: Create database**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 1,
+  "method": "herodb_createDatabase",
+  "params": [
+    "Lance",
+    { "name": "openai-docs", "storage_path": null, "max_size": null, "redis_version": null },
+    null
+  ]
+}
+```
+
+**Step 2: Configure OpenAI embeddings**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 2,
+  "method": "herodb_lanceSetEmbeddingConfig",
+  "params": [
+    1,
+    "articles",
+    {
+      "provider": "openai",
+      "model": "text-embedding-3-small",
+      "dim": 1536,
+      "endpoint": null,
+      "headers": {},
+      "timeout_ms": 30000
+    }
+  ]
+}
+```
+
+Redis-like:
+```bash
+SELECT 1
+LANCE.EMBEDDING CONFIG SET articles PROVIDER openai MODEL text-embedding-3-small DIM 1536
+```
+
+**Step 3: Insert articles**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 3,
+  "method": "herodb_lanceStoreText",
+  "params": [
+    1,
+    "articles",
+    "article-1",
+    "Climate change is affecting global weather patterns and ecosystems",
+    { "category": "environment", "author": "Jane Smith", "year": "2024" }
+  ]
+}
+```
+
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 4,
+  "method": "herodb_lanceStoreText",
+  "params": [
+    1,
+    "articles",
+    "article-2",
+    "Quantum computing promises to revolutionize cryptography and drug discovery",
+    { "category": "technology", "author": "John Doe", "year": "2024" }
+  ]
+}
+```
+
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 5,
+  "method": "herodb_lanceStoreText",
+  "params": [
+    1,
+    "articles",
+    "article-3",
+    "Renewable energy sources like solar and wind are becoming more cost-effective",
+    { "category": "environment", "author": "Alice Johnson", "year": "2023" }
+  ]
+}
+```
+
+**Step 4: Semantic search**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 6,
+  "method": "herodb_lanceSearchText",
+  "params": [
+    1,
+    "articles",
+    "environmental sustainability and green energy",
+    2,
+    null,
+    ["category", "author"]
+  ]
+}
+```
+
+Redis-like:
+```bash
+LANCE.SEARCH articles K 2 QUERY "environmental sustainability and green energy" RETURN 2 category author
+```
+
+Expected: Returns article-1 and article-3 (both environment-related).
+
+**Step 5: Filtered search**
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 7,
+  "method": "herodb_lanceSearchText",
+  "params": [
+    1,
+    "articles",
+    "new technology innovations",
+    5,
+    "category = 'technology'",
+    null
+  ]
+}
+```
+
+---
+
+## Scenario 3: Deterministic Test Embedder (No Network)
+
+For CI/offline development, use the built-in test embedder that requires no external service.
+
+### Configuration
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 1,
+  "method": "herodb_lanceSetEmbeddingConfig",
+  "params": [
+    1,
+    "testdata",
+    {
+      "provider": "test",
+      "model": "dev",
+      "dim": 64,
+      "endpoint": null,
+      "headers": {},
+      "timeout_ms": null
+    }
+  ]
+}
+```
+
+Redis-like:
+```bash
+SELECT 1
+LANCE.EMBEDDING CONFIG SET testdata PROVIDER test MODEL dev DIM 64
+```
+
+### Usage
+
+Use `lanceStoreText` and `lanceSearchText` as in previous scenarios. The embeddings are:
+- Deterministic (same text → same vector)
+- Fast (no network)
+- Not semantic (hash-based, not ML)
+
+Perfect for testing the vector storage/search mechanics without external dependencies.
+
+---
+
+## Advanced: Custom Headers and Timeouts
+
+### Example: Local model with custom auth
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 1,
+  "method": "herodb_lanceSetEmbeddingConfig",
+  "params": [
+    1,
+    "secure-data",
+    {
+      "provider": "openai",
+      "model": "custom-model",
+      "dim": 512,
+      "endpoint": "http://192.168.1.100:9000/embeddings",
+      "headers": {
+        "Authorization": "Bearer my-local-token",
+        "X-Custom-Header": "value"
+      },
+      "timeout_ms": 60000
+    }
+  ]
+}
+```
+
+### Example: OpenAI with explicit API key (not from env)
+
+JSON-RPC:
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 1,
+  "method": "herodb_lanceSetEmbeddingConfig",
+  "params": [
+    1,
+    "dataset",
+    {
+      "provider": "openai",
+      "model": "text-embedding-3-small",
+      "dim": 1536,
+      "endpoint": null,
+      "headers": {
+        "Authorization": "Bearer sk-your-key-here"
+      },
+      "timeout_ms": 30000
+    }
+  ]
+}
+```
+
+---
+
+## Troubleshooting
+
+### Error: "Embedding config not set for dataset"
+
+**Cause:** You tried to use `lanceStoreText` or `lanceSearchText` without configuring an embedder.
+
+**Solution:** Run `lanceSetEmbeddingConfig` first.
+
+### Error: "Embedding dimension mismatch: expected X, got Y"
+
+**Cause:** The embedding service returned vectors of a different size than configured.
+
+**Solution:** 
+- For OpenAI text-embedding-3-small, use `dim: 1536`
+- For your local mock (from this tutorial), use `dim: 768`
+- Check your embedding service's actual output dimension
+
+### Error: "Missing API key in env 'OPENAI_API_KEY'"
+
+**Cause:** Using OpenAI provider without setting the API key.
+
+**Solution:**
+- Set `export OPENAI_API_KEY="sk-..."` before starting HeroDB, OR
+- Pass the key explicitly in headers: `"Authorization": "Bearer sk-..."`
+
+### Error: "HTTP request failed" or "Embeddings API error 404"
+
+**Cause:** Cannot reach the embedding endpoint.
+
+**Solution:**
+- Verify your local server is running: `curl http://127.0.0.1:8081/v1/embeddings`
+- Check the endpoint URL in your config
+- Ensure firewall allows the connection
+
+### Error: "ERR DB backend is not Lance"
+
+**Cause:** Trying to use LANCE.* commands on a non-Lance database.
+
+**Solution:** Create the database with backend "Lance" (see Step 1).
+
+### Error: "write permission denied"
+
+**Cause:** Database is private and you haven't authenticated.
+
+**Solution:** Use `SELECT <db_id> KEY <access-key>` or make the database public via RPC.
+
+---
+
+## Complete Example Script (Bash + curl)
+
+Save as `test_embeddings.sh`:
+
+```bash
+#!/bin/bash
+
+RPC_URL="http://localhost:8080"
+
+# 1. Create Lance database
+curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
+  "jsonrpc": "2.0",
+  "id": 1,
+  "method": "herodb_createDatabase",
+  "params": ["Lance", {"name": "test-vectors", "storage_path": null, "max_size": null, "redis_version": null}, null]
+}'
+
+echo -e "\n"
+
+# 2. Configure local embedder
+curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
+  "jsonrpc": "2.0",
+  "id": 2,
+  "method": "herodb_lanceSetEmbeddingConfig",
+  "params": [1, "products", {
+    "provider": "openai",
+    "model": "mock",
+    "dim": 768,
+    "endpoint": "http://127.0.0.1:8081/v1/embeddings",
+    "headers": {"Authorization": "Bearer dummy"},
+    "timeout_ms": 30000
+  }]
+}'
+
+echo -e "\n"
+
+# 3. Insert data
+curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
+  "jsonrpc": "2.0",
+  "id": 3,
+  "method": "herodb_lanceStoreText",
+  "params": [1, "products", "item-1", "Hiking boots", {"brand": "TrailMax"}]
+}'
+
+echo -e "\n"
+
+# 4. Search
+curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
+  "jsonrpc": "2.0",
+  "id": 4,
+  "method": "herodb_lanceSearchText",
+  "params": [1, "products", "outdoor footwear", 5, null, null]
+}'
+
+echo -e "\n"
+```
+
+Run:
+```bash
+chmod +x test_embeddings.sh
+./test_embeddings.sh
+```
+
+---
+
+## Summary
+
+| Provider | Use Case | Endpoint | API Key |
+|----------|----------|----------|---------|
+| `openai` | Production semantic search | Default (OpenAI) or custom URL | OPENAI_API_KEY env or headers |
+| `openai` | Local self-hosted gateway | http://127.0.0.1:8081/... | Optional (depends on your service) |
+| `test` | CI/offline development | N/A (local hash) | None |
+| `image_test` | Image testing | N/A (local hash) | None |
+
+**Notes:**
+- The `provider` field is always `"openai"` for OpenAI-compatible services (whether cloud or local). This is because it uses the OpenAI-compatible API shape.
+- Use `endpoint` to point to your local service
+- Use `headers` for custom authentication
+- `dim` must match your embedding service's output dimension
+- Once configured, `lanceStoreText` and `lanceSearchText` handle embedding automatically