Files
herodb/docs/local_embedder_full_example.md
Maxime Van Hees 483ccb2ba8 updated docs
2025-10-20 11:38:21 +02:00

17 KiB

HeroDB Embedding Models: Complete Tutorial

This tutorial demonstrates how to use embedding models with HeroDB for vector search, covering local self-hosted models, OpenAI's API, and deterministic test embedders.

Table of Contents


Prerequisites

Start HeroDB Server

Build and start HeroDB with RPC enabled:

cargo build --release
./target/release/herodb --dir ./data --admin-secret my-admin-secret --enable-rpc --rpc-port 8080

This starts:

  • Redis-compatible server on port 6379
  • JSON-RPC server on port 8080

Client Tools

For Redis-like commands:

redis-cli -p 6379

For JSON-RPC calls, use curl:

curl -X POST http://localhost:8080 \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"herodb_METHOD","params":[...]}'

Scenario 1: Local Embedding Model (Testing)

Run your own embedding service locally for development, testing, or privacy.

Option A: Python Mock Server (Simplest)

This creates a minimal OpenAI-compatible embedding server for testing.

1. Create mock_embedder.py:

from flask import Flask, request, jsonify
import numpy as np

app = Flask(__name__)

@app.route('/v1/embeddings', methods=['POST'])
def embeddings():
    """OpenAI-compatible embeddings endpoint"""
    data = request.json
    inputs = data.get('input', [])
    
    # Handle both single string and array
    if isinstance(inputs, str):
        inputs = [inputs]
    
    # Generate deterministic 768-dim embeddings (hash-based)
    embeddings = []
    for text in inputs:
        # Simple hash to vector (deterministic)
        vec = np.zeros(768)
        for i, char in enumerate(text[:768]):
            vec[i % 768] += ord(char) / 255.0
        
        # L2 normalize
        norm = np.linalg.norm(vec)
        if norm > 0:
            vec = vec / norm
        
        embeddings.append(vec.tolist())
    
    return jsonify({
        "data": [{"embedding": emb, "index": i} for i, emb in enumerate(embeddings)],
        "model": data.get('model', 'mock-local'),
        "usage": {"total_tokens": sum(len(t) for t in inputs)}
    })

if __name__ == '__main__':
    print("Starting mock embedding server on http://127.0.0.1:8081")
    app.run(host='127.0.0.1', port=8081, debug=False)

2. Install dependencies and run:

pip install flask numpy
python mock_embedder.py

Output: Starting mock embedding server on http://127.0.0.1:8081

3. Test the server (optional):

curl -X POST http://127.0.0.1:8081/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input":["hello world"],"model":"test"}'

You should see a JSON response with a 768-dimensional embedding.

End-to-End Example with Local Model

Step 1: Create a Lance database

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_createDatabase",
  "params": [
    "Lance",
    { "name": "local-vectors", "storage_path": null, "max_size": null, "redis_version": null },
    null
  ]
}

Expected response:

{"jsonrpc":"2.0","id":1,"result":1}

The database ID is 1.

Step 2: Configure embedding for the dataset

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [
    1,
    "products",
    {
      "provider": "openai",
      "model": "mock-local",
      "dim": 768,
      "endpoint": "http://127.0.0.1:8081/v1/embeddings",
      "headers": {
        "Authorization": "Bearer dummy"
      },
      "timeout_ms": 30000
    }
  ]
}

Redis-like:

redis-cli -p 6379
SELECT 1
LANCE.EMBEDDING CONFIG SET products PROVIDER openai MODEL mock-local DIM 768 ENDPOINT http://127.0.0.1:8081/v1/embeddings HEADER Authorization "Bearer dummy" TIMEOUTMS 30000

Expected response:

{"jsonrpc":"2.0","id":2,"result":true}

Step 3: Verify configuration

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "herodb_lanceGetEmbeddingConfig",
  "params": [1, "products"]
}

Redis-like:

LANCE.EMBEDDING CONFIG GET products

Expected: Returns your configuration with provider, model, dim, endpoint, etc.

Step 4: Insert product data

JSON-RPC (item 1):

{
  "jsonrpc": "2.0",
  "id": 4,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "products",
    "item-1",
    "Waterproof hiking boots with ankle support and aggressive tread",
    { "brand": "TrailMax", "category": "footwear", "price": "129.99" }
  ]
}

Redis-like:

LANCE.STORE products ID item-1 TEXT "Waterproof hiking boots with ankle support and aggressive tread" META brand TrailMax category footwear price 129.99

JSON-RPC (item 2):

{
  "jsonrpc": "2.0",
  "id": 5,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "products",
    "item-2",
    "Lightweight running shoes with breathable mesh upper",
    { "brand": "SpeedFit", "category": "footwear", "price": "89.99" }
  ]
}

JSON-RPC (item 3):

{
  "jsonrpc": "2.0",
  "id": 6,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "products",
    "item-3",
    "Insulated winter jacket with removable hood and multiple pockets",
    { "brand": "WarmTech", "category": "outerwear", "price": "199.99" }
  ]
}

JSON-RPC (item 4):

{
  "jsonrpc": "2.0",
  "id": 7,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "products",
    "item-4",
    "Camping tent for 4 people with waterproof rainfly",
    { "brand": "OutdoorPro", "category": "camping", "price": "249.99" }
  ]
}

Expected response for each: {"jsonrpc":"2.0","id":N,"result":true}

Step 5: Search by text query

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 8,
  "method": "herodb_lanceSearchText",
  "params": [
    1,
    "products",
    "boots for hiking in wet conditions",
    3,
    null,
    ["brand", "category", "price"]
  ]
}

Redis-like:

LANCE.SEARCH products K 3 QUERY "boots for hiking in wet conditions" RETURN 3 brand category price

Expected response:

{
  "jsonrpc": "2.0",
  "id": 8,
  "result": {
    "results": [
      {
        "id": "item-1",
        "score": 0.234,
        "meta": {
          "brand": "TrailMax",
          "category": "footwear",
          "price": "129.99"
        }
      },
      ...
    ]
  }
}

Step 6: Search with metadata filter

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 9,
  "method": "herodb_lanceSearchText",
  "params": [
    1,
    "products",
    "comfortable shoes for running",
    5,
    "category = 'footwear'",
    null
  ]
}

Redis-like:

LANCE.SEARCH products K 5 QUERY "comfortable shoes for running" FILTER "category = 'footwear'"

This returns only items where category equals 'footwear'.

Step 7: List datasets

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 10,
  "method": "herodb_lanceList",
  "params": [1]
}

Redis-like:

LANCE.LIST

Step 8: Get dataset info

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 11,
  "method": "herodb_lanceInfo",
  "params": [1, "products"]
}

Redis-like:

LANCE.INFO products

Returns dimension, row count, and other metadata.


Scenario 2: OpenAI API

Use OpenAI's production embedding service for semantic search.

Setup

1. Set your API key:

export OPENAI_API_KEY="sk-your-actual-openai-key-here"

2. Start HeroDB (same as before):

./target/release/herodb --dir ./data --admin-secret my-admin-secret --enable-rpc --rpc-port 8080

End-to-End Example with OpenAI

Step 1: Create a Lance database

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_createDatabase",
  "params": [
    "Lance",
    { "name": "openai-vectors", "storage_path": null, "max_size": null, "redis_version": null },
    null
  ]
}

Expected: {"jsonrpc":"2.0","id":1,"result":1} (database ID = 1)

Step 2: Configure OpenAI embeddings

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [
    1,
    "documents",
    {
      "provider": "openai",
      "model": "text-embedding-3-small",
      "dim": 1536,
      "endpoint": null,
      "headers": {},
      "timeout_ms": 30000
    }
  ]
}

Redis-like:

redis-cli -p 6379
SELECT 1
LANCE.EMBEDDING CONFIG SET documents PROVIDER openai MODEL text-embedding-3-small DIM 1536 TIMEOUTMS 30000

Notes:

  • endpoint is null (defaults to OpenAI API: https://api.openai.com/v1/embeddings)
  • headers is empty (Authorization auto-added from OPENAI_API_KEY env var)
  • dim is 1536 for text-embedding-3-small

Expected: {"jsonrpc":"2.0","id":2,"result":true}

Step 3: Insert documents

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "documents",
    "doc-1",
    "The quick brown fox jumps over the lazy dog",
    { "source": "example", "lang": "en", "topic": "animals" }
  ]
}
{
  "jsonrpc": "2.0",
  "id": 4,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "documents",
    "doc-2",
    "Machine learning models require large datasets for training and validation",
    { "source": "tech", "lang": "en", "topic": "ai" }
  ]
}
{
  "jsonrpc": "2.0",
  "id": 5,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "documents",
    "doc-3",
    "Python is a popular programming language for data science and web development",
    { "source": "tech", "lang": "en", "topic": "programming" }
  ]
}

Redis-like:

LANCE.STORE documents ID doc-1 TEXT "The quick brown fox jumps over the lazy dog" META source example lang en topic animals
LANCE.STORE documents ID doc-2 TEXT "Machine learning models require large datasets for training and validation" META source tech lang en topic ai
LANCE.STORE documents ID doc-3 TEXT "Python is a popular programming language for data science and web development" META source tech lang en topic programming

Expected for each: {"jsonrpc":"2.0","id":N,"result":true}

Step 4: Semantic search

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 6,
  "method": "herodb_lanceSearchText",
  "params": [
    1,
    "documents",
    "artificial intelligence and neural networks",
    3,
    null,
    ["source", "topic"]
  ]
}

Redis-like:

LANCE.SEARCH documents K 3 QUERY "artificial intelligence and neural networks" RETURN 2 source topic

Expected response (doc-2 should rank highest due to semantic similarity):

{
  "jsonrpc": "2.0",
  "id": 6,
  "result": {
    "results": [
      {
        "id": "doc-2",
        "score": 0.123,
        "meta": {
          "source": "tech",
          "topic": "ai"
        }
      },
      {
        "id": "doc-3",
        "score": 0.456,
        "meta": {
          "source": "tech",
          "topic": "programming"
        }
      },
      {
        "id": "doc-1",
        "score": 0.789,
        "meta": {
          "source": "example",
          "topic": "animals"
        }
      }
    ]
  }
}

Note: Lower score = better match (L2 distance).

Step 5: Search with filter

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 7,
  "method": "herodb_lanceSearchText",
  "params": [
    1,
    "documents",
    "programming and software",
    5,
    "topic = 'programming'",
    null
  ]
}

Redis-like:

LANCE.SEARCH documents K 5 QUERY "programming and software" FILTER "topic = 'programming'"

This returns only documents where topic equals 'programming'.



Scenario 3: Deterministic Test Embedder (No Network)

For CI/offline development, use the built-in test embedder that requires no external service.

Configuration

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [
    1,
    "testdata",
    {
      "provider": "test",
      "model": "dev",
      "dim": 64,
      "endpoint": null,
      "headers": {},
      "timeout_ms": null
    }
  ]
}

Redis-like:

SELECT 1
LANCE.EMBEDDING CONFIG SET testdata PROVIDER test MODEL dev DIM 64

Usage

Use lanceStoreText and lanceSearchText as in previous scenarios. The embeddings are:

  • Deterministic (same text → same vector)
  • Fast (no network)
  • Not semantic (hash-based, not ML)

Perfect for testing the vector storage/search mechanics without external dependencies.


Advanced: Custom Headers and Timeouts

Example: Local model with custom auth

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [
    1,
    "secure-data",
    {
      "provider": "openai",
      "model": "custom-model",
      "dim": 512,
      "endpoint": "http://192.168.1.100:9000/embeddings",
      "headers": {
        "Authorization": "Bearer my-local-token",
        "X-Custom-Header": "value"
      },
      "timeout_ms": 60000
    }
  ]
}

Example: OpenAI with explicit API key (not from env)

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [
    1,
    "dataset",
    {
      "provider": "openai",
      "model": "text-embedding-3-small",
      "dim": 1536,
      "endpoint": null,
      "headers": {
        "Authorization": "Bearer sk-your-key-here"
      },
      "timeout_ms": 30000
    }
  ]
}

Troubleshooting

Error: "Embedding config not set for dataset"

Cause: You tried to use lanceStoreText or lanceSearchText without configuring an embedder.

Solution: Run lanceSetEmbeddingConfig first.

Error: "Embedding dimension mismatch: expected X, got Y"

Cause: The embedding service returned vectors of a different size than configured.

Solution:

  • For OpenAI text-embedding-3-small, use dim: 1536
  • For your local mock (from this tutorial), use dim: 768
  • Check your embedding service's actual output dimension

Error: "Missing API key in env 'OPENAI_API_KEY'"

Cause: Using OpenAI provider without setting the API key.

Solution:

  • Set export OPENAI_API_KEY="sk-..." before starting HeroDB, OR
  • Pass the key explicitly in headers: "Authorization": "Bearer sk-..."

Error: "HTTP request failed" or "Embeddings API error 404"

Cause: Cannot reach the embedding endpoint.

Solution:

  • Verify your local server is running: curl http://127.0.0.1:8081/v1/embeddings
  • Check the endpoint URL in your config
  • Ensure firewall allows the connection

Error: "ERR DB backend is not Lance"

Cause: Trying to use LANCE.* commands on a non-Lance database.

Solution: Create the database with backend "Lance" (see Step 1).

Error: "write permission denied"

Cause: Database is private and you haven't authenticated.

Solution: Use SELECT <db_id> KEY <access-key> or make the database public via RPC.


Complete Example Script (Bash + curl)

Save as test_embeddings.sh:

#!/bin/bash

RPC_URL="http://localhost:8080"

# 1. Create Lance database
curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_createDatabase",
  "params": ["Lance", {"name": "test-vectors", "storage_path": null, "max_size": null, "redis_version": null}, null]
}'

echo -e "\n"

# 2. Configure local embedder
curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [1, "products", {
    "provider": "openai",
    "model": "mock",
    "dim": 768,
    "endpoint": "http://127.0.0.1:8081/v1/embeddings",
    "headers": {"Authorization": "Bearer dummy"},
    "timeout_ms": 30000
  }]
}'

echo -e "\n"

# 3. Insert data
curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "herodb_lanceStoreText",
  "params": [1, "products", "item-1", "Hiking boots", {"brand": "TrailMax"}]
}'

echo -e "\n"

# 4. Search
curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
  "jsonrpc": "2.0",
  "id": 4,
  "method": "herodb_lanceSearchText",
  "params": [1, "products", "outdoor footwear", 5, null, null]
}'

echo -e "\n"

Run:

chmod +x test_embeddings.sh
./test_embeddings.sh

Summary

Provider Use Case Endpoint API Key
openai Production semantic search Default (OpenAI) or custom URL OPENAI_API_KEY env or headers
openai Local self-hosted gateway http://127.0.0.1:8081/... Optional (depends on your service)
test CI/offline development N/A (local hash) None
image_test Image testing N/A (local hash) None

Notes:

  • The provider field is always "openai" for OpenAI-compatible services (whether cloud or local). This is because it uses the OpenAI-compatible API shape.
  • Use endpoint to point to your local service
  • Use headers for custom authentication
  • dim must match your embedding service's output dimension
  • Once configured, lanceStoreText and lanceSearchText handle embedding automatically