Files
herodb/specs/backgroundinfo/lance.md
2025-08-23 04:57:47 +02:00

43 KiB

Based on your request, here is a copy of the provided code snippets filtered to include only those relevant to the Rust ecosystem. This includes snippets written in Rust, shell commands for building or testing Rust code (e.g., using cargo or maturin), and configurations for native development tools like lldb.

======================== CODE SNIPPETS

TITLE: Perform Python development installation DESCRIPTION: These commands navigate into the python directory and perform a development installation of the Lance Python bindings. This allows developers to import and test changes to the Python wrapper directly.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_1

LANGUAGE: bash CODE:

cd python
maturin develop

TITLE: Install Python bindings build tool DESCRIPTION: This command installs maturin, a tool essential for building Python packages that integrate with Rust code. It's a prerequisite for setting up the Python development environment for Lance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_0

LANGUAGE: bash CODE:

pip install maturin

TITLE: Install Linux Perf Tools and Configure Kernel Parameters DESCRIPTION: Installs necessary Linux performance tools (perf) on Ubuntu systems and configures the perf_event_paranoid kernel parameter. This setup is crucial for allowing non-root users to collect performance data using tools like perf and flamegraph.

SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_4

LANGUAGE: sh CODE:

sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`
sudo sh -c "echo -1 >  /proc/sys/kernel/perf_event_paranoid"

TITLE: Run Rust unit tests DESCRIPTION: This command executes the unit tests for the Rust core format. Running these tests verifies the correctness of the Rust implementation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_6

LANGUAGE: bash CODE:

cargo test

TITLE: Profile a LanceDB benchmark using flamegraph DESCRIPTION: Generates a flamegraph for a specific benchmark using cargo-flamegraph, aiding in performance analysis. It's recommended to run benchmarks once beforehand to avoid setup time being captured in the profile.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_14

LANGUAGE: shell CODE:

flamegraph -F 100 --no-inline -- $(which python) \
    -m pytest python/benchmarks \
    --benchmark-min-time=2 \
    -k test_ivf_pq_index_search

TITLE: Install Flamegraph Tool DESCRIPTION: Installs the flamegraph profiling tool using Cargo, Rust's package manager. This tool is essential for visualizing CPU usage and call stacks as flame graphs for performance analysis.

SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_3

LANGUAGE: sh CODE:

cargo install flamegraph

TITLE: Install Lance Build Dependencies on Ubuntu DESCRIPTION: This command installs necessary system-level dependencies for building Lance on Ubuntu 22.04, including protobuf, SSL development libraries, and general build tools.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_0

LANGUAGE: bash CODE:

sudo apt install protobuf-compiler libssl-dev build-essential pkg-config gfortran

TITLE: Build Rust core format (release) DESCRIPTION: This command compiles the Rust core format in release mode. The release build is optimized for performance and is suitable for production deployments or benchmarking.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_5

LANGUAGE: bash CODE:

cargo build -r

TITLE: Debug Python Script with LLDB DESCRIPTION: Demonstrates how to start an LLDB debugging session for a Python script. It involves launching LLDB with the Python interpreter from a virtual environment and then running the target script within the LLDB prompt.

SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_2

LANGUAGE: sh CODE:

$ lldb ./venv/bin/python
(lldb) r script.py

TITLE: Install Lance Build Dependencies on Mac DESCRIPTION: This command installs the protobuf compiler using Homebrew, a required dependency for building Lance on macOS.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_1

LANGUAGE: bash CODE:

brew install protobuf

TITLE: Configure LLDB Initialization Settings DESCRIPTION: Sets up basic LLDB initialization settings in the ~/.lldbinit file. This includes configuring the number of source code lines to display before and after a stop, and enabling the loading of .lldbinit files from the current working directory.

SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_0

LANGUAGE: lldb CODE:

# ~/.lldbinit
settings set stop-line-count-before 15
settings set stop-line-count-after 15
settings set target.load-cwd-lldbinit true

TITLE: Complete Lance Dataset Write and Read Example in Rust DESCRIPTION: This Rust main function provides a complete example demonstrating the usage of write_dataset and read_dataset functions. It sets up the necessary arrow and lance imports, defines a temporary data path, and orchestrates the writing and subsequent reading of a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_2

LANGUAGE: Rust CODE:

use arrow::array::UInt32Array;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::{RecordBatch, RecordBatchIterator};
use futures::StreamExt;
use lance::dataset::{WriteMode, WriteParams};
use lance::Dataset;
use std::sync::Arc;

#[tokio::main]
async fn main() {
    let data_path: &str = "./temp_data.lance";

    write_dataset(data_path).await;
    read_dataset(data_path).await;
}

TITLE: Rust: Main Workflow for WikiText to LanceDB Ingestion DESCRIPTION: This comprehensive example demonstrates the full data ingestion pipeline in Rust. It initializes a Tokio runtime, loads a tokenizer, sets up the Hugging Face API to download WikiText Parquet files, processes them into a WikiTextBatchReader, and finally writes the data to a Lance dataset. It also includes verification of the created dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_2

LANGUAGE: Rust CODE:

fn main() -> Result<(), Box<dyn Error + Send + Sync>> {
    let rt = tokio::runtime::Runtime::new()?;
    rt.block_on(async {
        // Load tokenizer
        let tokenizer = load_tokenizer("gpt2")?;

        // Set up Hugging Face API
        // Download from https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-raw-v1
        let api = Api::new()?;
        let repo = api.repo(Repo::with_revision(
            "Salesforce/wikitext".into(),
            RepoType::Dataset,
            "main".into(),
        ));

        // Define the parquet files we want to download
        let train_files = vec![
            "wikitext-103-raw-v1/train-00000-of-00002.parquet",
            "wikitext-103-raw-v1/train-00001-of-00002.parquet",
        ];

        let mut parquet_readers = Vec::new();
        for file in &train_files {
            println!("Downloading file: {}", file);
            let file_path = repo.get(file)?;
            let data = std::fs::read(file_path)?;

            // Create a temporary file in the system temp directory and write the downloaded data to it
            let mut temp_file = NamedTempFile::new()?;
            temp_file.write_all(&data)?;

            // Create the parquet reader builder with a larger batch size
            let builder = ParquetRecordBatchReaderBuilder::try_new(temp_file.into_file())?
                .with_batch_size(8192); // Increase batch size for better performance
            parquet_readers.push(builder);
        }

        if parquet_readers.is_empty() {
            println!("No parquet files found to process.");
            return Ok(());
        }

        // Create batch reader
        let num_samples: u64 = 500_000;
        let batch_reader = WikiTextBatchReader::new(parquet_readers, tokenizer, Some(num_samples))?;

        // Save as Lance dataset
        println!("Writing to Lance dataset...");
        let lance_dataset_path = "rust_wikitext_lance_dataset.lance";

        let write_params = WriteParams::default();
        lance::Dataset::write(batch_reader, lance_dataset_path, Some(write_params)).await?;

        // Verify the dataset
        let ds = lance::Dataset::open(lance_dataset_path).await?;
        let scanner = ds.scan();
        let mut stream = scanner.try_into_stream().await?;

        let mut total_rows = 0;
        while let Some(batch_result) = stream.next().await {
            let batch = batch_result?;
            total_rows += batch.num_rows();
        }

        println!(
            "Lance dataset created successfully with {} rows",
            total_rows
        );
        println!("Dataset location: {}", lance_dataset_path);

        Ok(())
    })
}

TITLE: Build and Test Pylance Python Package DESCRIPTION: These commands set up a Python virtual environment, install maturin for Rust-Python binding, build the Pylance package in debug mode, and then run its associated tests.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_3

LANGUAGE: bash CODE:

cd python
python3 -m venv venv
source venv/bin/activate

pip install maturin

# Build debug build
maturin develop --extras tests

# Run pytest
pytest python/tests/

TITLE: Install Lance using Cargo DESCRIPTION: Installs the Lance Rust library as a command-line tool using the Cargo package manager.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_0

LANGUAGE: shell CODE:

cargo install lance

TITLE: Build pylance in release mode for benchmarks DESCRIPTION: Builds the pylance module in release mode with debug symbols, enabling benchmark execution and profiling. It includes benchmark-specific extras and features for data generation.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_10

LANGUAGE: shell CODE:

maturin develop --profile release-with-debug --extras benchmarks --features datagen

TITLE: Query Lance Dataset with Simple SQL in Rust DataFusion DESCRIPTION: This Rust example demonstrates how to register a Lance dataset as a table in DataFusion using LanceTableProvider and execute a simple SQL SELECT query to retrieve the first 10 rows. It shows the basic setup for integrating Lance with DataFusion.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_0

LANGUAGE: rust CODE:

use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;

let ctx = SessionContext::new();

ctx.register_table("dataset",
    Arc::new(LanceTableProvider::new(
    Arc::new(dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

let df = ctx.sql("SELECT * FROM dataset LIMIT 10").await?;
let result = df.collect().await?;

TITLE: Run LanceDB code formatters DESCRIPTION: Applies code formatting rules to the entire project. Specific commands like make format-python or cargo fmt can be used for language-specific formatting.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_4

LANGUAGE: shell CODE:

make format

TITLE: Build and Search HNSW Index for Vector Similarity in Rust DESCRIPTION: This Rust code provides a complete example for vector similarity search. It defines a ground_truth function for L2 distance calculation, create_test_vector_dataset to generate synthetic fixed-size list vectors, and a main function that orchestrates the process. The main function generates or loads a dataset, builds an HNSW index using lance_index::vector::hnsw, and then performs vector searches, measuring construction and search times, and calculating recall against ground truth.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/hnsw.md#_snippet_0

LANGUAGE: Rust CODE:

use std::collections::HashSet;
use std::sync::Arc;

use arrow::array::{types::Float32Type, Array, FixedSizeListArray};
use arrow::array::{AsArray, FixedSizeListBuilder, Float32Builder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchIterator;
use arrow_select::concat::concat;
use futures::stream::StreamExt;
use lance::Dataset;
use lance_index::vector::v3::subindex::IvfSubIndex;
use lance_index::vector::{
    flat::storage::FlatFloatStorage,
    hnsw::{builder::HnswBuildParams, HNSW},
};
use lance_linalg::distance::DistanceType;

fn ground_truth(fsl: &FixedSizeListArray, query: &[f32], k: usize) -> HashSet<u32> {
    let mut dists = vec![];
    for i in 0..fsl.len() {
        let dist = lance_linalg::distance::l2_distance(
            query,
            fsl.value(i).as_primitive::<Float32Type>().values(),
        );
        dists.push((dist, i as u32));
    }
    dists.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap());
    dists.truncate(k);
    dists.into_iter().map(|(_, i)| i).collect()
}

pub async fn create_test_vector_dataset(output: &str, num_rows: usize, dim: i32) {
    let schema = Arc::new(Schema::new(vec![Field::new(
        "vector",
        DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), dim),
        false,
    )]));

    let mut batches = Vec::new();

    // Create a few batches
    for _ in 0..2 {
        let v_builder = Float32Builder::new();
        let mut list_builder = FixedSizeListBuilder::new(v_builder, dim);

        for _ in 0..num_rows {
            for _ in 0..dim {
                list_builder.values().append_value(rand::random::<f32>());
            }
            list_builder.append(true);
        }
        let array = Arc::new(list_builder.finish());
        let batch = RecordBatch::try_new(schema.clone(), vec![array]).unwrap();
        batches.push(batch);
    }
    let batch_reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema.clone());
    println!("Writing dataset to {}", output);
    Dataset::write(batch_reader, output, None).await.unwrap();
}

#[tokio::main]
async fn main() {
    let uri: Option<String> = None; // None means generate test data
    let column = "vector";
    let ef = 100;
    let max_edges = 30;
    let max_level = 7;

    // 1. Generate a synthetic test data of specified dimensions
    let dataset = if uri.is_none() {
        println!("No uri is provided, generating test dataset...");
        let output = "test_vectors.lance";
        create_test_vector_dataset(output, 1000, 64).await;
        Dataset::open(output).await.expect("Failed to open dataset")
    } else {
        Dataset::open(uri.as_ref().unwrap())
            .await
            .expect("Failed to open dataset")
    };

    println!("Dataset schema: {:#?}", dataset.schema());
    let batches = dataset
        .scan()
        .project(&[column])
        .unwrap()
        .try_into_stream()
        .await
        .unwrap()
        .then(|batch| async move { batch.unwrap().column_by_name(column).unwrap().clone() })
        .collect::<Vec<_>>()
        .await;
    let arrs = batches.iter().map(|b| b.as_ref()).collect::<Vec<_>>();
    let fsl = concat(&arrs).unwrap().as_fixed_size_list().clone();
    println!("Loaded {:?} batches", fsl.len());

    let vector_store = Arc::new(FlatFloatStorage::new(fsl.clone(), DistanceType::L2));

    let q = fsl.value(0);
    let k = 10;
    let gt = ground_truth(&fsl, q.as_primitive::<Float32Type>().values(), k);

    for ef_construction in [15, 30, 50] {
        let now = std::time::Instant::now();
        // 2. Build a hierarchical graph structure for efficient vector search using Lance API
        let hnsw = HNSW::index_vectors(
            vector_store.as_ref(),
            HnswBuildParams::default()
                .max_level(max_level)
                .num_edges(max_edges)
                .ef_construction(ef_construction),
        )
        .unwrap();
        let construct_time = now.elapsed().as_secs_f32();
        let now = std::time::Instant::now();
        // 3. Perform vector search with different parameters and compute the ground truth using L2 distance search
        let results: HashSet<u32> = hnsw
            .search_basic(q.clone(), k, ef, None, vector_store.as_ref())
            .unwrap()
            .iter()
            .map(|node| node.id)
            .collect();
        let search_time = now.elapsed().as_micros();
        println!(
            "level={}, ef_construct={}, ef={} recall={}: construct={:.3}s search={:.3} us",
            max_level,
            ef_construction,
            ef,
            results.intersection(&gt).count() as f32 / k as f32,
            construct_time,
            search_time
        );
    }
}

TITLE: Compare LanceDB benchmarks against previous version DESCRIPTION: Provides a sequence of commands to compare the performance of the current version against the main branch. This involves saving a baseline from main and then comparing the current branch's performance against it.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_15

LANGUAGE: shell CODE:

CURRENT_BRANCH=$(git branch --show-current)

LANGUAGE: shell CODE:

git checkout main

LANGUAGE: shell CODE:

maturin develop --profile release-with-debug  --features datagen

LANGUAGE: shell CODE:

pytest --benchmark-save=baseline python/benchmarks -m "not slow"

LANGUAGE: shell CODE:

COMPARE_ID=$(ls .benchmarks/*/ | tail -1 | cut -c1-4)

LANGUAGE: shell CODE:

git checkout $CURRENT_BRANCH

LANGUAGE: shell CODE:

maturin develop --profile release-with-debug  --features datagen

LANGUAGE: shell CODE:

pytest --benchmark-compare=$COMPARE_ID python/benchmarks -m "not slow"

TITLE: Build Rust core format (debug) DESCRIPTION: This command compiles the Rust core format in debug mode. The debug build includes debugging information and is suitable for development and testing, though it is not optimized for performance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_4

LANGUAGE: bash CODE:

cargo build

TITLE: Format and lint Rust code DESCRIPTION: These commands are used to automatically format Rust code according to community standards (cargo fmt) and to perform static analysis for potential issues (cargo clippy). This ensures code quality and consistency.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_3

LANGUAGE: bash CODE:

cargo fmt --all
cargo clippy --all-features --tests --benches

TITLE: Run LanceDB code linters DESCRIPTION: Executes code linters to check for style violations and potential issues. Language-specific linting can be performed with make lint-python or make lint-rust.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_5

LANGUAGE: shell CODE:

make lint

TITLE: Clean LanceDB build artifacts DESCRIPTION: Removes all generated build artifacts and temporary files from the project directory, useful for a clean rebuild.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_9

LANGUAGE: shell CODE:

make clean

TITLE: Rust: Load Tokenizer from Hugging Face Hub DESCRIPTION: This function provides a utility to load a tokenizer from the Hugging Face Hub. It takes a model name, creates an API client, retrieves the tokenizer file from the specified repository, and constructs a Tokenizer object from it. This is a common pattern for integrating Hugging Face models.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_3

LANGUAGE: Rust CODE:

fn load_tokenizer(model_name: &str) -> Result<Tokenizer, Box<dyn Error + Send + Sync>> {
    let api = Api::new()?;
    let repo = api.repo(Repo::with_revision(
        model_name.into(),
        RepoType::Model,
        "main".into(),
    ));

    let tokenizer_path = repo.get("tokenizer.json")?;
    let tokenizer = Tokenizer::from_file(tokenizer_path)?;

    Ok(tokenizer)
}

TITLE: Build MacOS x86_64 Wheels DESCRIPTION: This command builds release-mode wheels specifically for x86_64 MacOS. It uses maturin to compile the project for the x86_64-apple-darwin target, storing the resulting wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_26

LANGUAGE: Shell CODE:

maturin build --release \
    --target x86_64-apple-darwin \
    --out wheels

TITLE: Build and Test Lance Rust Package DESCRIPTION: These commands clone the Lance repository, navigate to the Rust directory, and then build, test, and benchmark the core Rust components of Lance.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_2

LANGUAGE: bash CODE:

git checkout https://github.com/lancedb/lance.git

# Build rust package
cd rust
cargo build

# Run test
cargo test

# Run benchmarks
cargo bench

TITLE: Build LanceDB in development mode DESCRIPTION: Builds the Rust native module in place using maturin. This command needs to be re-run whenever Rust code changes, but is not required for Python code modifications.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_0

LANGUAGE: shell CODE:

maturin develop

TITLE: Download Lindera Language Model DESCRIPTION: Command-line instruction to download a specific Lindera language model (e.g., ipadic, ko-dic, unidic) for LanceDB. Note that lindera-cli must be installed beforehand as Lindera models require compilation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_4

LANGUAGE: bash CODE:

python -m lance.download lindera -l [ipadic|ko-dic|unidic]

TITLE: Decorate Rust Unit Test for Tracing DESCRIPTION: To enable tracing for a Rust unit test, decorate it with the #[lance_test_macros::test] attribute. This macro wraps any existing test attributes, allowing tracing information to be collected during test execution.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_16

LANGUAGE: Rust CODE:

#[lance_test_macros::test(tokio::test)]
async fn test() {
    ...
}

TITLE: Add Rust Toolchain Targets for Cross-Compilation DESCRIPTION: To build manylinux wheels for different Linux architectures, you must first add the corresponding Rust toolchain targets. These commands add the x86_64 and aarch64 GNU targets, enabling cross-compilation.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_22

LANGUAGE: Shell CODE:

rustup target add x86_64-unknown-linux-gnu
rustup target add aarch64-unknown-linux-gnu

TITLE: Build MacOS ARM64 Wheels DESCRIPTION: This command builds release-mode wheels specifically for ARM64 (aarch64) MacOS. It uses maturin to compile the project for the aarch64-apple-darwin target, storing the resulting wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_25

LANGUAGE: Shell CODE:

maturin build --release \
    --target aarch64-apple-darwin \
    --out wheels

TITLE: Rust: WikiTextBatchReader Next Batch Logic DESCRIPTION: This snippet shows the core logic for the next method of the WikiTextBatchReader. It attempts to build and retrieve the next Parquet reader from a list of available readers. If a reader is successfully built, it's used; otherwise, it handles errors or indicates that no more readers are available.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_1

LANGUAGE: Rust CODE:

                if let Some(builder) = self.parquet_readers[self.current_reader_idx].take() {
                    match builder.build() {
                        Ok(reader) => {
                            self.current_reader = Some(Box::new(reader));
                            self.current_reader_idx += 1;
                            continue;
                        }
                        Err(e) => {
                            return Some(Err(arrow::error::ArrowError::ExternalError(Box::new(e))))
                        }
                    }
                }
            }

            // No more readers available
            return None;
        }

TITLE: Run Rust Unit Test with Tracing Verbosity DESCRIPTION: Execute a Rust unit test with tracing enabled by setting the LANCE_TESTING environment variable to a desired verbosity level (e.g., 'debug', 'info'). This command will generate a JSON trace file in your working directory, which can be viewed in Chrome or Perfetto.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_17

LANGUAGE: Bash CODE:

LANCE_TESTING=debug cargo test dataset::tests::test_create_dataset

TITLE: Build Linux x86_64 Manylinux Wheels DESCRIPTION: This command builds release-mode manylinux wheels for x86_64 Linux. It utilizes maturin with zig for cross-compilation, targeting manylinux2014 compatibility, and outputs the generated wheels to the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_23

LANGUAGE: Shell CODE:

maturin build --release --zig \
    --target x86_64-unknown-linux-gnu \
    --compatibility manylinux2014 \
    --out wheels

TITLE: Build Linux ARM64 Manylinux Wheels DESCRIPTION: This command builds release-mode manylinux wheels for ARM64 (aarch64) Linux. It uses maturin with zig for cross-compilation, targeting manylinux2014 compatibility, and places the output wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_24

LANGUAGE: Shell CODE:

maturin build --release --zig \
    --target aarch_64-unknown-linux-gnu \
    --compatibility manylinux2014 \
    --out wheels

TITLE: Join Multiple Lance Datasets with SQL in Rust DataFusion DESCRIPTION: This Rust example illustrates how to register multiple Lance datasets (e.g., 'orders' and 'customers') as separate tables in DataFusion. It then performs a SQL JOIN operation between these tables to combine data based on a common key, demonstrating more complex query capabilities.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_1

LANGUAGE: rust CODE:

use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;

let ctx = SessionContext::new();

ctx.register_table("orders",
    Arc::new(LanceTableProvider::new(
    Arc::new(orders_dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

ctx.register_table("customers",
    Arc::new(LanceTableProvider::new(
    Arc::new(customers_dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

let df = ctx.sql("
    SELECT o.order_id, o.amount, c.customer_name 
    FROM orders o 
    JOIN customers c ON o.customer_id = c.customer_id
    LIMIT 10
").await?;

let result = df.collect().await?;

TITLE: Generate Flame Graph from Process ID DESCRIPTION: Generates a flame graph for a running process using its Process ID (PID). This command is used to capture and visualize CPU profiles, helping to identify performance bottlenecks in an application.

SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_5

LANGUAGE: sh CODE:

flamegraph -p <PID>

TITLE: Clone LanceDB GitHub Repository DESCRIPTION: Instructions to clone the LanceDB project repository from GitHub to a local machine. This is the first step for setting up the development environment.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_11

LANGUAGE: shell CODE:

git clone https://github.com/lancedb/lance.git

TITLE: Rust Implementation of WikiTextBatchReader DESCRIPTION: This Rust code defines WikiTextBatchReader, a custom implementation of arrow::record_batch::RecordBatchReader. It's designed to read text data from Parquet files, tokenize it using a Tokenizer from the tokenizers crate, and transform it into Arrow RecordBatches. The process_batch method handles tokenization, limits the number of samples, and shuffles the tokenized IDs before creating the final RecordBatch.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_0

LANGUAGE: rust CODE:

use arrow::array::{Array, Int64Builder, ListBuilder, UInt32Array};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchReader;
use futures::StreamExt;
use hf_hub::{api::sync::Api, Repo, RepoType};
use lance::dataset::WriteParams;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use rand::seq::SliceRandom;
use rand::SeedableRng;
use std::error::Error;
use std::fs::File;
use std::io::Write;
use std::sync::Arc;
use tempfile::NamedTempFile;
use tokenizers::Tokenizer;

// Implement a custom stream batch reader
struct WikiTextBatchReader {
    schema: Arc<Schema>,
    parquet_readers: Vec<Option<ParquetRecordBatchReaderBuilder<File>>>,
    current_reader_idx: usize,
    current_reader: Option<Box<dyn RecordBatchReader + Send>>,
    tokenizer: Tokenizer,
    num_samples: u64,
    cur_samples_cnt: u64,
}

impl WikiTextBatchReader {
    fn new(
        parquet_readers: Vec<ParquetRecordBatchReaderBuilder<File>>,
        tokenizer: Tokenizer,
        num_samples: Option<u64>,
    ) -> Result<Self, Box<dyn Error + Send + Sync>> {
        let schema = Arc::new(Schema::new(vec![Field::new(
            "input_ids",
            DataType::List(Arc::new(Field::new("item", DataType::Int64, true))),
            false,
        )]));

        Ok(Self {
            schema,
            parquet_readers: parquet_readers.into_iter().map(Some).collect(),
            current_reader_idx: 0,
            current_reader: None,
            tokenizer,
            num_samples: num_samples.unwrap_or(100_000),
            cur_samples_cnt: 0,
        })
    }

    fn process_batch(
        &mut self,
        input_batch: &RecordBatch,
    ) -> Result<RecordBatch, arrow::error::ArrowError> {
        let num_rows = input_batch.num_rows();
        let mut token_builder = ListBuilder::new(Int64Builder::with_capacity(num_rows * 1024)); // Pre-allocate space
        let mut should_break = false;

        let column = input_batch.column_by_name("text").unwrap();
        let string_array = column
            .as_any()
            .downcast_ref::<arrow::array::StringArray>()
            .unwrap();
        for i in 0..num_rows {
            if self.cur_samples_cnt >= self.num_samples {
                should_break = true;
                break;
            }
            if !Array::is_null(string_array, i) {
                let text = string_array.value(i);
                // Split paragraph into lines
                for line in text.split('
') {
                    if let Ok(encoding) = self.tokenizer.encode(line, true) {
                        let tb_values = token_builder.values();
                        for &id in encoding.get_ids() {
                            tb_values.append_value(id as i64);
                        }
                        token_builder.append(true);
                        self.cur_samples_cnt += 1;
                        if self.cur_samples_cnt % 5000 == 0 {
                            println!("Processed {} rows", self.cur_samples_cnt);
                        }
                        if self.cur_samples_cnt >= self.num_samples {
                            should_break = true;
                            break;
                        }
                    }
                }
            }
        }

        // Create array and shuffle it
        let input_ids_array = token_builder.finish();

        // Create shuffled array by randomly sampling indices
        let mut rng = rand::rngs::StdRng::seed_from_u64(1337);
        let len = input_ids_array.len();
        let mut indices: Vec<u32> = (0..len as u32).collect();
        indices.shuffle(&mut rng);

        // Take values in shuffled order
        let indices_array = UInt32Array::from(indices);
        let shuffled = arrow::compute::take(&input_ids_array, &indices_array, None)?;

        let batch = RecordBatch::try_new(self.schema.clone(), vec![Arc::new(shuffled)]);
        if should_break {
            println!("Stop at {} rows", self.cur_samples_cnt);
            self.parquet_readers.clear();
            self.current_reader = None;
        }

        batch
    }
}

impl RecordBatchReader for WikiTextBatchReader {
    fn schema(&self) -> Arc<Schema> {
        self.schema.clone()
    }
}

impl Iterator for WikiTextBatchReader {
    type Item = Result<RecordBatch, arrow::error::ArrowError>;
    fn next(&mut self) -> Option<Self::Item> {
        loop {
            // If we have a current reader, try to get next batch
            if let Some(reader) = &mut self.current_reader {
                if let Some(batch_result) = reader.next() {
                    return Some(batch_result.and_then(|batch| self.process_batch(&batch)));
                }
            }

            // If no current reader or current reader is exhausted, try to get next reader
            if self.current_reader_idx < self.parquet_readers.len() {

TITLE: Set DYLD_LIBRARY_PATH for Lance Python Debugging in LLDB DESCRIPTION: Configures the DYLD_LIBRARY_PATH environment variable specifically for debugging Lance Python projects within LLDB. This ensures that the dynamic linker can find necessary shared libraries located in the third-party distribution directory.

SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_1

LANGUAGE: lldb CODE:

# /path/to/lance/python/.lldbinit
env DYLD_LIBRARY_PATH=/path/to/thirdparty/dist/lib:${DYLD_LIBRARY_PATH}

TITLE: Download and extract MeCab Ipadic model DESCRIPTION: This snippet downloads the gzipped tarball of the MeCab Ipadic model from GitHub and then extracts its contents using tar. This is the first step in preparing the dictionary for building.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_0

LANGUAGE: bash CODE:

curl -L -o mecab-ipadic-2.7.0-20070801.tar.gz "https://github.com/lindera-morphology/mecab-ipadic/archive/refs/tags/2.7.0-20070801.tar.gz"
tar xvf mecab-ipadic-2.7.0-20070801.tar.gz

TITLE: Build user dictionary with Lindera DESCRIPTION: This command demonstrates how to build a custom user dictionary using 'lindera build'. It takes a CSV file as input and creates a new user dictionary, which can be used to extend the base language model.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_2

LANGUAGE: bash CODE:

lindera build --build-user-dictionary --dictionary-kind=ipadic user_dict/userdict.csv user_dict2

TITLE: Download Jieba Language Model DESCRIPTION: Command-line instruction to download the Jieba language model for use with LanceDB. The model will be automatically stored in the default Jieba model directory within the configured language model home.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_1

LANGUAGE: bash CODE:

python -m lance.download jieba

TITLE: Read and Inspect Lance Dataset in Rust DESCRIPTION: This Rust function read_dataset shows how to open an existing Lance dataset from a given path. It uses a scanner to create a batch_stream and then iterates through each RecordBatch, printing its number of rows, columns, schema, and the entire batch content.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_1

LANGUAGE: Rust CODE:

// Reads dataset from the given path and prints batch size, schema for all record batches. Also extracts and prints a slice from the first batch
async fn read_dataset(data_path: &str) {
    let dataset = Dataset::open(data_path).await.unwrap();
    let scanner = dataset.scan();

    let mut batch_stream = scanner.try_into_stream().await.unwrap().map(|b| b.unwrap());

    while let Some(batch) = batch_stream.next().await {
        println!("Batch size: {}, {}", batch.num_rows(), batch.num_columns()); // print size of batch
        println!("Schema: {:?}", batch.schema()); // print schema of recordbatch

        println!("Batch: {:?}", batch); // print the entire recordbatch (schema and data)
    }
} // End read dataset

TITLE: Create a Lance Dataset from Arrow RecordBatches in Rust DESCRIPTION: Demonstrates how to write a collection of Arrow RecordBatches and an Arrow Schema into a new Lance Dataset. It uses default write parameters and an iterator for the batches.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_1

LANGUAGE: rust CODE:

use lance::{dataset::WriteParams, Dataset};

let write_params = WriteParams::default();
let mut reader = RecordBatchIterator::new(
    batches.into_iter().map(Ok),
    schema
);
Dataset::write(reader, &uri, Some(write_params)).await.unwrap();

TITLE: Build Ipadic language model with Lindera DESCRIPTION: This command uses the 'lindera build' tool to compile the Ipadic dictionary. It specifies the dictionary kind as 'ipadic' and points to the extracted model directory to create the main dictionary.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_1

LANGUAGE: bash CODE:

lindera build --dictionary-kind=ipadic mecab-ipadic-2.7.0-20070801 main

TITLE: Write Lance Dataset in Rust DESCRIPTION: This Rust function write_dataset demonstrates how to create and write a Lance dataset to a specified path. It defines a schema with UInt32 fields, creates a RecordBatch with sample data, and uses WriteParams to set the write mode to Overwrite before writing the dataset to disk.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_0

LANGUAGE: Rust CODE:

// Writes sample dataset to the given path
async fn write_dataset(data_path: &str) {
    // Define new schema
    let schema = Arc::new(Schema::new(vec![
        Field::new("key", DataType::UInt32, false),
        Field::new("value", DataType::UInt32, false),
    ]));

    // Create new record batches
    let batch = RecordBatch::try_new(
        schema.clone(),
        vec![
            Arc::new(UInt32Array::from(vec![1, 2, 3, 4, 5, 6])),
            Arc::new(UInt32Array::from(vec![6, 7, 8, 9, 10, 11])),
        ],
    )
    .unwrap();

    let batches = RecordBatchIterator::new([Ok(batch)], schema.clone());

    // Define write parameters (e.g. overwrite dataset)
    let write_params = WriteParams {
        mode: WriteMode::Overwrite,
        ..Default::default()
    };

    Dataset::write(batches, data_path, Some(write_params))
        .await
        .unwrap();
} // End write dataset

TITLE: Build LanceDB Rust JNI Module DESCRIPTION: Specifies the command to build only the Rust-based JNI (Java Native Interface) module of LanceDB. This is useful for developers focusing on the native components without rebuilding the entire Java project.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_10

LANGUAGE: shell CODE:

cargo build

TITLE: Read a Lance Dataset and Collect RecordBatches in Rust DESCRIPTION: Opens an existing Lance Dataset from a specified path, scans its content, and collects all resulting RecordBatches into a vector. Error handling is included.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_2

LANGUAGE: rust CODE:

let dataset = Dataset::open(path).await.unwrap();
let mut scanner = dataset.scan();
let batches: Vec<RecordBatch> = scanner
    .try_into_stream()
    .await
    .unwrap()
    .map(|b| b.unwrap())
    .collect::<Vec<RecordBatch>>()
    .await;

TITLE: Create a Vector Index on a Lance Dataset in Rust DESCRIPTION: Demonstrates how to create a vector index on a specified column (e.g., 'embeddings') within a Lance Dataset. It configures vector index parameters like the number of partitions and sub-vectors, noting potential alignment requirements.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_4

LANGUAGE: rust CODE:

use ::lance::index::vector::VectorIndexParams;

let params = VectorIndexParams::default();
params.num_partitions = 256;
params.num_sub_vectors = 16;

// this will Err if list_size(embeddings) / num_sub_vectors does not meet simd alignment
dataset.create_index(&["embeddings"], IndexType::Vector, None, &params, true).await;

TITLE: Retrieve Specific Records from a Lance Dataset in Rust DESCRIPTION: Retrieves specific records from a Lance Dataset based on their indices and a projection. The result is a RecordBatch containing the requested data.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_3

LANGUAGE: rust CODE:

let values: Result<RecordBatch> = dataset.take(&[200, 199, 39, 40, 100], &projection).await;