Files

Maxime Van Hees df780e20a2 fixed a few bugs related to vector embedding + added additional end to end documentation to showcase local and external embedders step-by-step + added example mock embedder python script

2025-10-16 15:30:45 +02:00

20 KiB

Raw Blame History

HeroDB Embedding Models: Complete Tutorial

This tutorial demonstrates how to use embedding models with HeroDB for vector search, covering both local self-hosted models and OpenAI's API.

Prerequisites
Scenario 1: Local Embedding Model
Scenario 2: OpenAI API
Scenario 3: Deterministic Test Embedder
Troubleshooting

Prerequisites

Start HeroDB Server

Build and start HeroDB with RPC enabled:

cargo build --release
./target/release/herodb --dir ./data --admin-secret my-admin-secret --enable-rpc --rpc-port 8080

This starts:

Redis-compatible server on port 6379
JSON-RPC server on port 8080

Client Tools

For Redis-like commands:

redis-cli -p 6379

For JSON-RPC calls, use curl:

curl -X POST http://localhost:8080 \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"herodb_METHOD","params":[...]}'

Scenario 1: Local Embedding Model (Testing)

Run your own embedding service locally for development, testing, or privacy.

Option A: Python Mock Server (Simplest)

This creates a minimal OpenAI-compatible embedding server for testing.

1. Create mock_embedder.py:

from flask import Flask, request, jsonify
import numpy as np

app = Flask(__name__)

@app.route('/v1/embeddings', methods=['POST'])
def embeddings():
    """OpenAI-compatible embeddings endpoint"""
    data = request.json
    inputs = data.get('input', [])
    
    # Handle both single string and array
    if isinstance(inputs, str):
        inputs = [inputs]
    
    # Generate deterministic 768-dim embeddings (hash-based)
    embeddings = []
    for text in inputs:
        # Simple hash to vector (deterministic)
        vec = np.zeros(768)
        for i, char in enumerate(text[:768]):
            vec[i % 768] += ord(char) / 255.0
        
        # L2 normalize
        norm = np.linalg.norm(vec)
        if norm > 0:
            vec = vec / norm
        
        embeddings.append(vec.tolist())
    
    return jsonify({
        "data": [{"embedding": emb, "index": i} for i, emb in enumerate(embeddings)],
        "model": data.get('model', 'mock-local'),
        "usage": {"total_tokens": sum(len(t) for t in inputs)}
    })

if __name__ == '__main__':
    print("Starting mock embedding server on http://127.0.0.1:8081")
    app.run(host='127.0.0.1', port=8081, debug=False)

2. Install dependencies and run:

pip install flask numpy
python mock_embedder.py

Output: Starting mock embedding server on http://127.0.0.1:8081

3. Test the server (optional):

curl -X POST http://127.0.0.1:8081/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input":["hello world"],"model":"test"}'

You should see a JSON response with a 768-dimensional embedding.

End-to-End Example with Local Model

Step 1: Create a Lance database

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_createDatabase",
  "params": [
    "Lance",
    { "name": "local-vectors", "storage_path": null, "max_size": null, "redis_version": null },
    null
  ]
}

Expected response:

{"jsonrpc":"2.0","id":1,"result":1}

The database ID is 1.

Step 2: Configure embedding for the dataset

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [
    1,
    "products",
    {
      "provider": "openai",
      "model": "mock-local",
      "dim": 768,
      "endpoint": "http://127.0.0.1:8081/v1/embeddings",
      "headers": {
        "Authorization": "Bearer dummy"
      },
      "timeout_ms": 30000
    }
  ]
}

Redis-like:

redis-cli -p 6379
SELECT 1
LANCE.EMBEDDING CONFIG SET products PROVIDER openai MODEL mock-local DIM 768 ENDPOINT http://127.0.0.1:8081/v1/embeddings HEADER Authorization "Bearer dummy" TIMEOUTMS 30000

Expected response:

{"jsonrpc":"2.0","id":2,"result":true}

Step 3: Verify configuration

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "herodb_lanceGetEmbeddingConfig",
  "params": [1, "products"]
}

Redis-like:

LANCE.EMBEDDING CONFIG GET products

Expected: Returns your configuration with provider, model, dim, endpoint, etc.

Step 4: Insert product data

JSON-RPC (item 1):

{
  "jsonrpc": "2.0",
  "id": 4,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "products",
    "item-1",
    "Waterproof hiking boots with ankle support and aggressive tread",
    { "brand": "TrailMax", "category": "footwear", "price": "129.99" }
  ]
}

Redis-like:

LANCE.STORE products ID item-1 TEXT "Waterproof hiking boots with ankle support and aggressive tread" META brand TrailMax category footwear price 129.99

JSON-RPC (item 2):

{
  "jsonrpc": "2.0",
  "id": 5,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "products",
    "item-2",
    "Lightweight running shoes with breathable mesh upper",
    { "brand": "SpeedFit", "category": "footwear", "price": "89.99" }
  ]
}

JSON-RPC (item 3):

{
  "jsonrpc": "2.0",
  "id": 6,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "products",
    "item-3",
    "Insulated winter jacket with removable hood and multiple pockets",
    { "brand": "WarmTech", "category": "outerwear", "price": "199.99" }
  ]
}

JSON-RPC (item 4):

{
  "jsonrpc": "2.0",
  "id": 7,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "products",
    "item-4",
    "Camping tent for 4 people with waterproof rainfly",
    { "brand": "OutdoorPro", "category": "camping", "price": "249.99" }
  ]
}

Expected response for each: {"jsonrpc":"2.0","id":N,"result":true}

Step 5: Search by text query

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 8,
  "method": "herodb_lanceSearchText",
  "params": [
    1,
    "products",
    "boots for hiking in wet conditions",
    3,
    null,
    ["brand", "category", "price"]
  ]
}

Redis-like:

LANCE.SEARCH products K 3 QUERY "boots for hiking in wet conditions" RETURN 3 brand category price

Expected response:

{
  "jsonrpc": "2.0",
  "id": 8,
  "result": {
    "results": [
      {
        "id": "item-1",
        "score": 0.234,
        "meta": {
          "brand": "TrailMax",
          "category": "footwear",
          "price": "129.99"
        }
      },
      ...
    ]
  }
}

Step 6: Search with metadata filter

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 9,
  "method": "herodb_lanceSearchText",
  "params": [
    1,
    "products",
    "comfortable shoes for running",
    5,
    "category = 'footwear'",
    null
  ]
}

Redis-like:

LANCE.SEARCH products K 5 QUERY "comfortable shoes for running" FILTER "category = 'footwear'"

This returns only items where category equals 'footwear'.

Step 7: List datasets

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 10,
  "method": "herodb_lanceList",
  "params": [1]
}

Redis-like:

LANCE.LIST

Step 8: Get dataset info

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 11,
  "method": "herodb_lanceInfo",
  "params": [1, "products"]
}

Redis-like:

LANCE.INFO products

Returns dimension, row count, and other metadata.

Scenario 2: OpenAI API

Use OpenAI's production embedding service for semantic search.

Setup

1. Set your API key:

export OPENAI_API_KEY="sk-your-actual-openai-key-here"

2. Start HeroDB (same as before):

./target/release/herodb --dir ./data --admin-secret my-admin-secret --enable-rpc --rpc-port 8080

End-to-End Example with OpenAI

Step 1: Create a Lance database

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_createDatabase",
  "params": [
    "Lance",
    { "name": "openai-vectors", "storage_path": null, "max_size": null, "redis_version": null },
    null
  ]
}

Expected: {"jsonrpc":"2.0","id":1,"result":1} (database ID = 1)

Step 2: Configure OpenAI embeddings

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [
    1,
    "documents",
    {
      "provider": "openai",
      "model": "text-embedding-3-small",
      "dim": 1536,
      "endpoint": null,
      "headers": {},
      "timeout_ms": 30000
    }
  ]
}

Redis-like:

redis-cli -p 6379
SELECT 1
LANCE.EMBEDDING CONFIG SET documents PROVIDER openai MODEL text-embedding-3-small DIM 1536 TIMEOUTMS 30000

Notes:

endpoint is null (defaults to OpenAI API: https://api.openai.com/v1/embeddings)
headers is empty (Authorization auto-added from OPENAI_API_KEY env var)
dim is 1536 for text-embedding-3-small

Expected: {"jsonrpc":"2.0","id":2,"result":true}

Step 3: Insert documents

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "documents",
    "doc-1",
    "The quick brown fox jumps over the lazy dog",
    { "source": "example", "lang": "en", "topic": "animals" }
  ]
}

{
  "jsonrpc": "2.0",
  "id": 4,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "documents",
    "doc-2",
    "Machine learning models require large datasets for training and validation",
    { "source": "tech", "lang": "en", "topic": "ai" }
  ]
}

{
  "jsonrpc": "2.0",
  "id": 5,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "documents",
    "doc-3",
    "Python is a popular programming language for data science and web development",
    { "source": "tech", "lang": "en", "topic": "programming" }
  ]
}

Redis-like:

LANCE.STORE documents ID doc-1 TEXT "The quick brown fox jumps over the lazy dog" META source example lang en topic animals
LANCE.STORE documents ID doc-2 TEXT "Machine learning models require large datasets for training and validation" META source tech lang en topic ai
LANCE.STORE documents ID doc-3 TEXT "Python is a popular programming language for data science and web development" META source tech lang en topic programming

Expected for each: {"jsonrpc":"2.0","id":N,"result":true}

Step 4: Semantic search

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 6,
  "method": "herodb_lanceSearchText",
  "params": [
    1,
    "documents",
    "artificial intelligence and neural networks",
    3,
    null,
    ["source", "topic"]
  ]
}

Redis-like:

LANCE.SEARCH documents K 3 QUERY "artificial intelligence and neural networks" RETURN 2 source topic

Expected response (doc-2 should rank highest due to semantic similarity):

{
  "jsonrpc": "2.0",
  "id": 6,
  "result": {
    "results": [
      {
        "id": "doc-2",
        "score": 0.123,
        "meta": {
          "source": "tech",
          "topic": "ai"
        }
      },
      {
        "id": "doc-3",
        "score": 0.456,
        "meta": {
          "source": "tech",
          "topic": "programming"
        }
      },
      {
        "id": "doc-1",
        "score": 0.789,
        "meta": {
          "source": "example",
          "topic": "animals"
        }
      }
    ]
  }
}

Note: Lower score = better match (L2 distance).

Step 5: Search with filter

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 7,
  "method": "herodb_lanceSearchText",
  "params": [
    1,
    "documents",
    "programming and software",
    5,
    "topic = 'programming'",
    null
  ]
}

Redis-like:

LANCE.SEARCH documents K 5 QUERY "programming and software" FILTER "topic = 'programming'"

This returns only documents where topic equals 'programming'.

Scenario 2: OpenAI API

Use OpenAI's production embedding service for high-quality semantic search.

Setup

1. Set your OpenAI API key:

export OPENAI_API_KEY="sk-your-actual-openai-key-here"

2. Start HeroDB:

./target/release/herodb --dir ./data --admin-secret my-admin-secret --enable-rpc --rpc-port 8080

Complete Workflow

Step 1: Create database

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_createDatabase",
  "params": [
    "Lance",
    { "name": "openai-docs", "storage_path": null, "max_size": null, "redis_version": null },
    null
  ]
}

Step 2: Configure OpenAI embeddings

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [
    1,
    "articles",
    {
      "provider": "openai",
      "model": "text-embedding-3-small",
      "dim": 1536,
      "endpoint": null,
      "headers": {},
      "timeout_ms": 30000
    }
  ]
}

Redis-like:

SELECT 1
LANCE.EMBEDDING CONFIG SET articles PROVIDER openai MODEL text-embedding-3-small DIM 1536

Step 3: Insert articles

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "articles",
    "article-1",
    "Climate change is affecting global weather patterns and ecosystems",
    { "category": "environment", "author": "Jane Smith", "year": "2024" }
  ]
}

{
  "jsonrpc": "2.0",
  "id": 4,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "articles",
    "article-2",
    "Quantum computing promises to revolutionize cryptography and drug discovery",
    { "category": "technology", "author": "John Doe", "year": "2024" }
  ]
}

{
  "jsonrpc": "2.0",
  "id": 5,
  "method": "herodb_lanceStoreText",
  "params": [
    1,
    "articles",
    "article-3",
    "Renewable energy sources like solar and wind are becoming more cost-effective",
    { "category": "environment", "author": "Alice Johnson", "year": "2023" }
  ]
}

Step 4: Semantic search

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 6,
  "method": "herodb_lanceSearchText",
  "params": [
    1,
    "articles",
    "environmental sustainability and green energy",
    2,
    null,
    ["category", "author"]
  ]
}

Redis-like:

LANCE.SEARCH articles K 2 QUERY "environmental sustainability and green energy" RETURN 2 category author

Expected: Returns article-1 and article-3 (both environment-related).

Step 5: Filtered search

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 7,
  "method": "herodb_lanceSearchText",
  "params": [
    1,
    "articles",
    "new technology innovations",
    5,
    "category = 'technology'",
    null
  ]
}

Scenario 3: Deterministic Test Embedder (No Network)

For CI/offline development, use the built-in test embedder that requires no external service.

Configuration

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [
    1,
    "testdata",
    {
      "provider": "test",
      "model": "dev",
      "dim": 64,
      "endpoint": null,
      "headers": {},
      "timeout_ms": null
    }
  ]
}

Redis-like:

SELECT 1
LANCE.EMBEDDING CONFIG SET testdata PROVIDER test MODEL dev DIM 64

Usage

Use lanceStoreText and lanceSearchText as in previous scenarios. The embeddings are:

Deterministic (same text → same vector)
Fast (no network)
Not semantic (hash-based, not ML)

Perfect for testing the vector storage/search mechanics without external dependencies.

Advanced: Custom Headers and Timeouts

Example: Local model with custom auth

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [
    1,
    "secure-data",
    {
      "provider": "openai",
      "model": "custom-model",
      "dim": 512,
      "endpoint": "http://192.168.1.100:9000/embeddings",
      "headers": {
        "Authorization": "Bearer my-local-token",
        "X-Custom-Header": "value"
      },
      "timeout_ms": 60000
    }
  ]
}

Example: OpenAI with explicit API key (not from env)

JSON-RPC:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [
    1,
    "dataset",
    {
      "provider": "openai",
      "model": "text-embedding-3-small",
      "dim": 1536,
      "endpoint": null,
      "headers": {
        "Authorization": "Bearer sk-your-key-here"
      },
      "timeout_ms": 30000
    }
  ]
}

Troubleshooting

Error: "Embedding config not set for dataset"

Cause: You tried to use lanceStoreText or lanceSearchText without configuring an embedder.

Solution: Run lanceSetEmbeddingConfig first.

Error: "Embedding dimension mismatch: expected X, got Y"

Cause: The embedding service returned vectors of a different size than configured.

Solution:

For OpenAI text-embedding-3-small, use dim: 1536
For your local mock (from this tutorial), use dim: 768
Check your embedding service's actual output dimension

Error: "Missing API key in env 'OPENAI_API_KEY'"

Cause: Using OpenAI provider without setting the API key.

Solution:

Set export OPENAI_API_KEY="sk-..." before starting HeroDB, OR
Pass the key explicitly in headers: "Authorization": "Bearer sk-..."

Error: "HTTP request failed" or "Embeddings API error 404"

Cause: Cannot reach the embedding endpoint.

Solution:

Verify your local server is running: curl http://127.0.0.1:8081/v1/embeddings
Check the endpoint URL in your config
Ensure firewall allows the connection

Error: "ERR DB backend is not Lance"

Cause: Trying to use LANCE.* commands on a non-Lance database.

Solution: Create the database with backend "Lance" (see Step 1).

Error: "write permission denied"

Cause: Database is private and you haven't authenticated.

Solution: Use SELECT <db_id> KEY <access-key> or make the database public via RPC.

Complete Example Script (Bash + curl)

Save as test_embeddings.sh:

#!/bin/bash

RPC_URL="http://localhost:8080"

# 1. Create Lance database
curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "herodb_createDatabase",
  "params": ["Lance", {"name": "test-vectors", "storage_path": null, "max_size": null, "redis_version": null}, null]
}'

echo -e "\n"

# 2. Configure local embedder
curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "herodb_lanceSetEmbeddingConfig",
  "params": [1, "products", {
    "provider": "openai",
    "model": "mock",
    "dim": 768,
    "endpoint": "http://127.0.0.1:8081/v1/embeddings",
    "headers": {"Authorization": "Bearer dummy"},
    "timeout_ms": 30000
  }]
}'

echo -e "\n"

# 3. Insert data
curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "herodb_lanceStoreText",
  "params": [1, "products", "item-1", "Hiking boots", {"brand": "TrailMax"}]
}'

echo -e "\n"

# 4. Search
curl -X POST $RPC_URL -H "Content-Type: application/json" -d '{
  "jsonrpc": "2.0",
  "id": 4,
  "method": "herodb_lanceSearchText",
  "params": [1, "products", "outdoor footwear", 5, null, null]
}'

echo -e "\n"

Run:

chmod +x test_embeddings.sh
./test_embeddings.sh

Summary

Provider	Use Case	Endpoint	API Key
`openai`	Production semantic search	Default (OpenAI) or custom URL	OPENAI_API_KEY env or headers
`openai`	Local self-hosted gateway	http://127.0.0.1:8081/...	Optional (depends on your service)
`test`	CI/offline development	N/A (local hash)	None
`image_test`	Image testing	N/A (local hash)	None

Notes:

The provider field is always "openai" for OpenAI-compatible services (whether cloud or local). This is because it uses the OpenAI-compatible API shape.
Use endpoint to point to your local service
Use headers for custom authentication
dim must match your embedding service's output dimension
Once configured, lanceStoreText and lanceSearchText handle embedding automatically

20 KiB Raw Blame History

HeroDB Embedding Models: Complete Tutorial

Table of Contents

Prerequisites

Start HeroDB Server

Client Tools

Scenario 1: Local Embedding Model (Testing)

Option A: Python Mock Server (Simplest)

End-to-End Example with Local Model

Scenario 2: OpenAI API

Setup

End-to-End Example with OpenAI

Scenario 2: OpenAI API

Setup

Complete Workflow

Scenario 3: Deterministic Test Embedder (No Network)

Configuration

Usage

Advanced: Custom Headers and Timeouts

Example: Local model with custom auth

Example: OpenAI with explicit API key (not from env)

Troubleshooting

Error: "Embedding config not set for dataset"

Error: "Embedding dimension mismatch: expected X, got Y"

Error: "Missing API key in env 'OPENAI_API_KEY'"

Error: "HTTP request failed" or "Embeddings API error 404"

Error: "ERR DB backend is not Lance"

Error: "write permission denied"

Complete Example Script (Bash + curl)

Summary

20 KiB

Raw Blame History