# HeroDB Embedding Models: Complete Tutorial This tutorial demonstrates how to use embedding models with HeroDB for vector search, covering both local self-hosted models and OpenAI's API. ## Table of Contents - [Prerequisites](#prerequisites) - [Scenario 1: Local Embedding Model](#scenario-1-local-embedding-model-testing) - [Scenario 2: OpenAI API](#scenario-2-openai-api) - [Scenario 3: Deterministic Test Embedder](#scenario-3-deterministic-test-embedder-no-network) - [Troubleshooting](#troubleshooting) --- ## Prerequisites ### Start HeroDB Server Build and start HeroDB with RPC enabled: ```bash cargo build --release ./target/release/herodb --dir ./data --admin-secret my-admin-secret --enable-rpc --rpc-port 8080 ``` This starts: - Redis-compatible server on port 6379 - JSON-RPC server on port 8080 ### Client Tools For Redis-like commands: ```bash redis-cli -p 6379 ``` For JSON-RPC calls, use `curl`: ```bash curl -X POST http://localhost:8080 \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","id":1,"method":"herodb_METHOD","params":[...]}' ``` --- ## Scenario 1: Local Embedding Model (Testing) Run your own embedding service locally for development, testing, or privacy. ### Option A: Python Mock Server (Simplest) This creates a minimal OpenAI-compatible embedding server for testing. **1. Create `mock_embedder.py`:** ```python from flask import Flask, request, jsonify import numpy as np app = Flask(__name__) @app.route('/v1/embeddings', methods=['POST']) def embeddings(): """OpenAI-compatible embeddings endpoint""" data = request.json inputs = data.get('input', []) # Handle both single string and array if isinstance(inputs, str): inputs = [inputs] # Generate deterministic 768-dim embeddings (hash-based) embeddings = [] for text in inputs: # Simple hash to vector (deterministic) vec = np.zeros(768) for i, char in enumerate(text[:768]): vec[i % 768] += ord(char) / 255.0 # L2 normalize norm = np.linalg.norm(vec) if norm > 0: vec = vec / norm embeddings.append(vec.tolist()) return jsonify({ "data": [{"embedding": emb, "index": i} for i, emb in enumerate(embeddings)], "model": data.get('model', 'mock-local'), "usage": {"total_tokens": sum(len(t) for t in inputs)} }) if __name__ == '__main__': print("Starting mock embedding server on http://127.0.0.1:8081") app.run(host='127.0.0.1', port=8081, debug=False) ``` **2. Install dependencies and run:** ```bash pip install flask numpy python mock_embedder.py ``` Output: `Starting mock embedding server on http://127.0.0.1:8081` **3. Test the server (optional):** ```bash curl -X POST http://127.0.0.1:8081/v1/embeddings \ -H "Content-Type: application/json" \ -d '{"input":["hello world"],"model":"test"}' ``` You should see a JSON response with a 768-dimensional embedding. ### End-to-End Example with Local Model **Step 1: Create a Lance database** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 1, "method": "herodb_createDatabase", "params": [ "Lance", { "name": "local-vectors", "storage_path": null, "max_size": null, "redis_version": null }, null ] } ``` Expected response: ```json {"jsonrpc":"2.0","id":1,"result":1} ``` The database ID is `1`. **Step 2: Configure embedding for the dataset** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 2, "method": "herodb_lanceSetEmbeddingConfig", "params": [ 1, "products", { "provider": "openai", "model": "mock-local", "dim": 768, "endpoint": "http://127.0.0.1:8081/v1/embeddings", "headers": { "Authorization": "Bearer dummy" }, "timeout_ms": 30000 } ] } ``` Redis-like: ```bash redis-cli -p 6379 SELECT 1 LANCE.EMBEDDING CONFIG SET products PROVIDER openai MODEL mock-local DIM 768 ENDPOINT http://127.0.0.1:8081/v1/embeddings HEADER Authorization "Bearer dummy" TIMEOUTMS 30000 ``` Expected response: ```json {"jsonrpc":"2.0","id":2,"result":true} ``` **Step 3: Verify configuration** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 3, "method": "herodb_lanceGetEmbeddingConfig", "params": [1, "products"] } ``` Redis-like: ```bash LANCE.EMBEDDING CONFIG GET products ``` Expected: Returns your configuration with provider, model, dim, endpoint, etc. **Step 4: Insert product data** JSON-RPC (item 1): ```json { "jsonrpc": "2.0", "id": 4, "method": "herodb_lanceStoreText", "params": [ 1, "products", "item-1", "Waterproof hiking boots with ankle support and aggressive tread", { "brand": "TrailMax", "category": "footwear", "price": "129.99" } ] } ``` Redis-like: ```bash LANCE.STORE products ID item-1 TEXT "Waterproof hiking boots with ankle support and aggressive tread" META brand TrailMax category footwear price 129.99 ``` JSON-RPC (item 2): ```json { "jsonrpc": "2.0", "id": 5, "method": "herodb_lanceStoreText", "params": [ 1, "products", "item-2", "Lightweight running shoes with breathable mesh upper", { "brand": "SpeedFit", "category": "footwear", "price": "89.99" } ] } ``` JSON-RPC (item 3): ```json { "jsonrpc": "2.0", "id": 6, "method": "herodb_lanceStoreText", "params": [ 1, "products", "item-3", "Insulated winter jacket with removable hood and multiple pockets", { "brand": "WarmTech", "category": "outerwear", "price": "199.99" } ] } ``` JSON-RPC (item 4): ```json { "jsonrpc": "2.0", "id": 7, "method": "herodb_lanceStoreText", "params": [ 1, "products", "item-4", "Camping tent for 4 people with waterproof rainfly", { "brand": "OutdoorPro", "category": "camping", "price": "249.99" } ] } ``` Expected response for each: `{"jsonrpc":"2.0","id":N,"result":true}` **Step 5: Search by text query** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 8, "method": "herodb_lanceSearchText", "params": [ 1, "products", "boots for hiking in wet conditions", 3, null, ["brand", "category", "price"] ] } ``` Redis-like: ```bash LANCE.SEARCH products K 3 QUERY "boots for hiking in wet conditions" RETURN 3 brand category price ``` Expected response: ```json { "jsonrpc": "2.0", "id": 8, "result": { "results": [ { "id": "item-1", "score": 0.234, "meta": { "brand": "TrailMax", "category": "footwear", "price": "129.99" } }, ... ] } } ``` **Step 6: Search with metadata filter** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 9, "method": "herodb_lanceSearchText", "params": [ 1, "products", "comfortable shoes for running", 5, "category = 'footwear'", null ] } ``` Redis-like: ```bash LANCE.SEARCH products K 5 QUERY "comfortable shoes for running" FILTER "category = 'footwear'" ``` This returns only items where `category` equals `'footwear'`. **Step 7: List datasets** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 10, "method": "herodb_lanceList", "params": [1] } ``` Redis-like: ```bash LANCE.LIST ``` **Step 8: Get dataset info** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 11, "method": "herodb_lanceInfo", "params": [1, "products"] } ``` Redis-like: ```bash LANCE.INFO products ``` Returns dimension, row count, and other metadata. --- ## Scenario 2: OpenAI API Use OpenAI's production embedding service for semantic search. ### Setup **1. Set your API key:** ```bash export OPENAI_API_KEY="sk-your-actual-openai-key-here" ``` **2. Start HeroDB** (same as before): ```bash ./target/release/herodb --dir ./data --admin-secret my-admin-secret --enable-rpc --rpc-port 8080 ``` ### End-to-End Example with OpenAI **Step 1: Create a Lance database** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 1, "method": "herodb_createDatabase", "params": [ "Lance", { "name": "openai-vectors", "storage_path": null, "max_size": null, "redis_version": null }, null ] } ``` Expected: `{"jsonrpc":"2.0","id":1,"result":1}` (database ID = 1) **Step 2: Configure OpenAI embeddings** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 2, "method": "herodb_lanceSetEmbeddingConfig", "params": [ 1, "documents", { "provider": "openai", "model": "text-embedding-3-small", "dim": 1536, "endpoint": null, "headers": {}, "timeout_ms": 30000 } ] } ``` Redis-like: ```bash redis-cli -p 6379 SELECT 1 LANCE.EMBEDDING CONFIG SET documents PROVIDER openai MODEL text-embedding-3-small DIM 1536 TIMEOUTMS 30000 ``` Notes: - `endpoint` is `null` (defaults to OpenAI API: https://api.openai.com/v1/embeddings) - `headers` is empty (Authorization auto-added from OPENAI_API_KEY env var) - `dim` is 1536 for text-embedding-3-small Expected: `{"jsonrpc":"2.0","id":2,"result":true}` **Step 3: Insert documents** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 3, "method": "herodb_lanceStoreText", "params": [ 1, "documents", "doc-1", "The quick brown fox jumps over the lazy dog", { "source": "example", "lang": "en", "topic": "animals" } ] } ``` ```json { "jsonrpc": "2.0", "id": 4, "method": "herodb_lanceStoreText", "params": [ 1, "documents", "doc-2", "Machine learning models require large datasets for training and validation", { "source": "tech", "lang": "en", "topic": "ai" } ] } ``` ```json { "jsonrpc": "2.0", "id": 5, "method": "herodb_lanceStoreText", "params": [ 1, "documents", "doc-3", "Python is a popular programming language for data science and web development", { "source": "tech", "lang": "en", "topic": "programming" } ] } ``` Redis-like: ```bash LANCE.STORE documents ID doc-1 TEXT "The quick brown fox jumps over the lazy dog" META source example lang en topic animals LANCE.STORE documents ID doc-2 TEXT "Machine learning models require large datasets for training and validation" META source tech lang en topic ai LANCE.STORE documents ID doc-3 TEXT "Python is a popular programming language for data science and web development" META source tech lang en topic programming ``` Expected for each: `{"jsonrpc":"2.0","id":N,"result":true}` **Step 4: Semantic search** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 6, "method": "herodb_lanceSearchText", "params": [ 1, "documents", "artificial intelligence and neural networks", 3, null, ["source", "topic"] ] } ``` Redis-like: ```bash LANCE.SEARCH documents K 3 QUERY "artificial intelligence and neural networks" RETURN 2 source topic ``` Expected response (doc-2 should rank highest due to semantic similarity): ```json { "jsonrpc": "2.0", "id": 6, "result": { "results": [ { "id": "doc-2", "score": 0.123, "meta": { "source": "tech", "topic": "ai" } }, { "id": "doc-3", "score": 0.456, "meta": { "source": "tech", "topic": "programming" } }, { "id": "doc-1", "score": 0.789, "meta": { "source": "example", "topic": "animals" } } ] } } ``` Note: Lower score = better match (L2 distance). **Step 5: Search with filter** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 7, "method": "herodb_lanceSearchText", "params": [ 1, "documents", "programming and software", 5, "topic = 'programming'", null ] } ``` Redis-like: ```bash LANCE.SEARCH documents K 5 QUERY "programming and software" FILTER "topic = 'programming'" ``` This returns only documents where `topic` equals `'programming'`. --- ## Scenario 2: OpenAI API Use OpenAI's production embedding service for high-quality semantic search. ### Setup **1. Set your OpenAI API key:** ```bash export OPENAI_API_KEY="sk-your-actual-openai-key-here" ``` **2. Start HeroDB:** ```bash ./target/release/herodb --dir ./data --admin-secret my-admin-secret --enable-rpc --rpc-port 8080 ``` ### Complete Workflow **Step 1: Create database** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 1, "method": "herodb_createDatabase", "params": [ "Lance", { "name": "openai-docs", "storage_path": null, "max_size": null, "redis_version": null }, null ] } ``` **Step 2: Configure OpenAI embeddings** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 2, "method": "herodb_lanceSetEmbeddingConfig", "params": [ 1, "articles", { "provider": "openai", "model": "text-embedding-3-small", "dim": 1536, "endpoint": null, "headers": {}, "timeout_ms": 30000 } ] } ``` Redis-like: ```bash SELECT 1 LANCE.EMBEDDING CONFIG SET articles PROVIDER openai MODEL text-embedding-3-small DIM 1536 ``` **Step 3: Insert articles** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 3, "method": "herodb_lanceStoreText", "params": [ 1, "articles", "article-1", "Climate change is affecting global weather patterns and ecosystems", { "category": "environment", "author": "Jane Smith", "year": "2024" } ] } ``` ```json { "jsonrpc": "2.0", "id": 4, "method": "herodb_lanceStoreText", "params": [ 1, "articles", "article-2", "Quantum computing promises to revolutionize cryptography and drug discovery", { "category": "technology", "author": "John Doe", "year": "2024" } ] } ``` ```json { "jsonrpc": "2.0", "id": 5, "method": "herodb_lanceStoreText", "params": [ 1, "articles", "article-3", "Renewable energy sources like solar and wind are becoming more cost-effective", { "category": "environment", "author": "Alice Johnson", "year": "2023" } ] } ``` **Step 4: Semantic search** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 6, "method": "herodb_lanceSearchText", "params": [ 1, "articles", "environmental sustainability and green energy", 2, null, ["category", "author"] ] } ``` Redis-like: ```bash LANCE.SEARCH articles K 2 QUERY "environmental sustainability and green energy" RETURN 2 category author ``` Expected: Returns article-1 and article-3 (both environment-related). **Step 5: Filtered search** JSON-RPC: ```json { "jsonrpc": "2.0", "id": 7, "method": "herodb_lanceSearchText", "params": [ 1, "articles", "new technology innovations", 5, "category = 'technology'", null ] } ``` --- ## Scenario 3: Deterministic Test Embedder (No Network) For CI/offline development, use the built-in test embedder that requires no external service. ### Configuration JSON-RPC: ```json { "jsonrpc": "2.0", "id": 1, "method": "herodb_lanceSetEmbeddingConfig", "params": [ 1, "testdata", { "provider": "test", "model": "dev", "dim": 64, "endpoint": null, "headers": {}, "timeout_ms": null } ] } ``` Redis-like: ```bash SELECT 1 LANCE.EMBEDDING CONFIG SET testdata PROVIDER test MODEL dev DIM 64 ``` ### Usage Use `lanceStoreText` and `lanceSearchText` as in previous scenarios. The embeddings are: - Deterministic (same text → same vector) - Fast (no network) - Not semantic (hash-based, not ML) Perfect for testing the vector storage/search mechanics without external dependencies. --- ## Advanced: Custom Headers and Timeouts ### Example: Local model with custom auth JSON-RPC: ```json { "jsonrpc": "2.0", "id": 1, "method": "herodb_lanceSetEmbeddingConfig", "params": [ 1, "secure-data", { "provider": "openai", "model": "custom-model", "dim": 512, "endpoint": "http://192.168.1.100:9000/embeddings", "headers": { "Authorization": "Bearer my-local-token", "X-Custom-Header": "value" }, "timeout_ms": 60000 } ] } ``` ### Example: OpenAI with explicit API key (not from env) JSON-RPC: ```json { "jsonrpc": "2.0", "id": 1, "method": "herodb_lanceSetEmbeddingConfig", "params": [ 1, "dataset", { "provider": "openai", "model": "text-embedding-3-small", "dim": 1536, "endpoint": null, "headers": { "Authorization": "Bearer sk-your-key-here" }, "timeout_ms": 30000 } ] } ``` --- ## Troubleshooting ### Error: "Embedding config not set for dataset" **Cause:** You tried to use `lanceStoreText` or `lanceSearchText` without configuring an embedder. **Solution:** Run `lanceSetEmbeddingConfig` first. ### Error: "Embedding dimension mismatch: expected X, got Y" **Cause:** The embedding service returned vectors of a different size than configured. **Solution:** - For OpenAI text-embedding-3-small, use `dim: 1536` - For your local mock (from this tutorial), use `dim: 768` - Check your embedding service's actual output dimension ### Error: "Missing API key in env 'OPENAI_API_KEY'" **Cause:** Using OpenAI provider without setting the API key. **Solution:** - Set `export OPENAI_API_KEY="sk-..."` before starting HeroDB, OR - Pass the key explicitly in headers: `"Authorization": "Bearer sk-..."` ### Error: "HTTP request failed" or "Embeddings API error 404" **Cause:** Cannot reach the embedding endpoint. **Solution:** - Verify your local server is running: `curl http://127.0.0.1:8081/v1/embeddings` - Check the endpoint URL in your config - Ensure firewall allows the connection ### Error: "ERR DB backend is not Lance" **Cause:** Trying to use LANCE.* commands on a non-Lance database. **Solution:** Create the database with backend "Lance" (see Step 1). ### Error: "write permission denied" **Cause:** Database is private and you haven't authenticated. **Solution:** Use `SELECT KEY ` or make the database public via RPC. --- ## Complete Example Script (Bash + curl) Save as `test_embeddings.sh`: ```bash #!/bin/bash RPC_URL="http://localhost:8080" # 1. Create Lance database curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{ "jsonrpc": "2.0", "id": 1, "method": "herodb_createDatabase", "params": ["Lance", {"name": "test-vectors", "storage_path": null, "max_size": null, "redis_version": null}, null] }' echo -e "\n" # 2. Configure local embedder curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{ "jsonrpc": "2.0", "id": 2, "method": "herodb_lanceSetEmbeddingConfig", "params": [1, "products", { "provider": "openai", "model": "mock", "dim": 768, "endpoint": "http://127.0.0.1:8081/v1/embeddings", "headers": {"Authorization": "Bearer dummy"}, "timeout_ms": 30000 }] }' echo -e "\n" # 3. Insert data curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{ "jsonrpc": "2.0", "id": 3, "method": "herodb_lanceStoreText", "params": [1, "products", "item-1", "Hiking boots", {"brand": "TrailMax"}] }' echo -e "\n" # 4. Search curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{ "jsonrpc": "2.0", "id": 4, "method": "herodb_lanceSearchText", "params": [1, "products", "outdoor footwear", 5, null, null] }' echo -e "\n" ``` Run: ```bash chmod +x test_embeddings.sh ./test_embeddings.sh ``` --- ## Summary | Provider | Use Case | Endpoint | API Key | |----------|----------|----------|---------| | `openai` | Production semantic search | Default (OpenAI) or custom URL | OPENAI_API_KEY env or headers | | `openai` | Local self-hosted gateway | http://127.0.0.1:8081/... | Optional (depends on your service) | | `test` | CI/offline development | N/A (local hash) | None | | `image_test` | Image testing | N/A (local hash) | None | **Notes:** - The `provider` field is always `"openai"` for OpenAI-compatible services (whether cloud or local). This is because it uses the OpenAI-compatible API shape. - Use `endpoint` to point to your local service - Use `headers` for custom authentication - `dim` must match your embedding service's output dimension - Once configured, `lanceStoreText` and `lanceSearchText` handle embedding automatically