herodb/specs/backgroundinfo/lance.md

Based on your request, here is a copy of the provided code snippets filtered to include only those relevant to the Rust ecosystem. This includes snippets written in Rust, shell commands for building or testing Rust code (e.g., using `cargo` or `maturin`), and configurations for native development tools like `lldb`.

========================
CODE SNIPPETS
========================
TITLE: Perform Python development installation
DESCRIPTION: These commands navigate into the `python` directory and perform a development installation of the Lance Python bindings. This allows developers to import and test changes to the Python wrapper directly.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_1

LANGUAGE: bash
CODE:
```
cd python
maturin develop
```

----------------------------------------

TITLE: Install Python bindings build tool
DESCRIPTION: This command installs `maturin`, a tool essential for building Python packages that integrate with Rust code. It's a prerequisite for setting up the Python development environment for Lance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_0

LANGUAGE: bash
CODE:
```
pip install maturin
```

----------------------------------------

TITLE: Install Linux Perf Tools and Configure Kernel Parameters
DESCRIPTION: Installs necessary Linux performance tools (`perf`) on Ubuntu systems and configures the `perf_event_paranoid` kernel parameter. This setup is crucial for allowing non-root users to collect performance data using tools like `perf` and `flamegraph`.

SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_4

LANGUAGE: sh
CODE:
```
sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`
sudo sh -c "echo -1 >  /proc/sys/kernel/perf_event_paranoid"
```

----------------------------------------

TITLE: Run Rust unit tests
DESCRIPTION: This command executes the unit tests for the Rust core format. Running these tests verifies the correctness of the Rust implementation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_6

LANGUAGE: bash
CODE:
```
cargo test
```

----------------------------------------

TITLE: Profile a LanceDB benchmark using flamegraph
DESCRIPTION: Generates a flamegraph for a specific benchmark using `cargo-flamegraph`, aiding in performance analysis. It's recommended to run benchmarks once beforehand to avoid setup time being captured in the profile.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_14

LANGUAGE: shell
CODE:
```
flamegraph -F 100 --no-inline -- $(which python) \
    -m pytest python/benchmarks \
    --benchmark-min-time=2 \
    -k test_ivf_pq_index_search
```

----------------------------------------

TITLE: Install Flamegraph Tool
DESCRIPTION: Installs the `flamegraph` profiling tool using Cargo, Rust's package manager. This tool is essential for visualizing CPU usage and call stacks as flame graphs for performance analysis.

SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_3

LANGUAGE: sh
CODE:
```
cargo install flamegraph
```

----------------------------------------

TITLE: Install Lance Build Dependencies on Ubuntu
DESCRIPTION: This command installs necessary system-level dependencies for building Lance on Ubuntu 22.04, including protobuf, SSL development libraries, and general build tools.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_0

LANGUAGE: bash
CODE:
```
sudo apt install protobuf-compiler libssl-dev build-essential pkg-config gfortran
```

----------------------------------------

TITLE: Build Rust core format (release)
DESCRIPTION: This command compiles the Rust core format in release mode. The release build is optimized for performance and is suitable for production deployments or benchmarking.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_5

LANGUAGE: bash
CODE:
```
cargo build -r
```

----------------------------------------

TITLE: Debug Python Script with LLDB
DESCRIPTION: Demonstrates how to start an LLDB debugging session for a Python script. It involves launching LLDB with the Python interpreter from a virtual environment and then running the target script within the LLDB prompt.

SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_2

LANGUAGE: sh
CODE:
```
$ lldb ./venv/bin/python
(lldb) r script.py
```

----------------------------------------

TITLE: Install Lance Build Dependencies on Mac
DESCRIPTION: This command installs the protobuf compiler using Homebrew, a required dependency for building Lance on macOS.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_1

LANGUAGE: bash
CODE:
```
brew install protobuf
```

----------------------------------------

TITLE: Configure LLDB Initialization Settings
DESCRIPTION: Sets up basic LLDB initialization settings in the `~/.lldbinit` file. This includes configuring the number of source code lines to display before and after a stop, and enabling the loading of `.lldbinit` files from the current working directory.

SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_0

LANGUAGE: lldb
CODE:
```
# ~/.lldbinit
settings set stop-line-count-before 15
settings set stop-line-count-after 15
settings set target.load-cwd-lldbinit true
```

----------------------------------------

TITLE: Complete Lance Dataset Write and Read Example in Rust
DESCRIPTION: This Rust `main` function provides a complete example demonstrating the usage of `write_dataset` and `read_dataset` functions. It sets up the necessary `arrow` and `lance` imports, defines a temporary data path, and orchestrates the writing and subsequent reading of a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_2

LANGUAGE: Rust
CODE:
```
use arrow::array::UInt32Array;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::{RecordBatch, RecordBatchIterator};
use futures::StreamExt;
use lance::dataset::{WriteMode, WriteParams};
use lance::Dataset;
use std::sync::Arc;

#[tokio::main]
async fn main() {
    let data_path: &str = "./temp_data.lance";

    write_dataset(data_path).await;
    read_dataset(data_path).await;
}
```

----------------------------------------

TITLE: Rust: Main Workflow for WikiText to LanceDB Ingestion
DESCRIPTION: This comprehensive example demonstrates the full data ingestion pipeline in Rust. It initializes a Tokio runtime, loads a tokenizer, sets up the Hugging Face API to download WikiText Parquet files, processes them into a `WikiTextBatchReader`, and finally writes the data to a Lance dataset. It also includes verification of the created dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_2

LANGUAGE: Rust
CODE:
```
fn main() -> Result<(), Box<dyn Error + Send + Sync>> {
    let rt = tokio::runtime::Runtime::new()?;
    rt.block_on(async {
        // Load tokenizer
        let tokenizer = load_tokenizer("gpt2")?;

        // Set up Hugging Face API
        // Download from https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-raw-v1
        let api = Api::new()?;
        let repo = api.repo(Repo::with_revision(
            "Salesforce/wikitext".into(),
            RepoType::Dataset,
            "main".into(),
        ));

        // Define the parquet files we want to download
        let train_files = vec![
            "wikitext-103-raw-v1/train-00000-of-00002.parquet",
            "wikitext-103-raw-v1/train-00001-of-00002.parquet",
        ];

        let mut parquet_readers = Vec::new();
        for file in &train_files {
            println!("Downloading file: {}", file);
            let file_path = repo.get(file)?;
            let data = std::fs::read(file_path)?;

            // Create a temporary file in the system temp directory and write the downloaded data to it
            let mut temp_file = NamedTempFile::new()?;
            temp_file.write_all(&data)?;

            // Create the parquet reader builder with a larger batch size
            let builder = ParquetRecordBatchReaderBuilder::try_new(temp_file.into_file())?
                .with_batch_size(8192); // Increase batch size for better performance
            parquet_readers.push(builder);
        }

        if parquet_readers.is_empty() {
            println!("No parquet files found to process.");
            return Ok(());
        }

        // Create batch reader
        let num_samples: u64 = 500_000;
        let batch_reader = WikiTextBatchReader::new(parquet_readers, tokenizer, Some(num_samples))?;

        // Save as Lance dataset
        println!("Writing to Lance dataset...");
        let lance_dataset_path = "rust_wikitext_lance_dataset.lance";

        let write_params = WriteParams::default();
        lance::Dataset::write(batch_reader, lance_dataset_path, Some(write_params)).await?;

        // Verify the dataset
        let ds = lance::Dataset::open(lance_dataset_path).await?;
        let scanner = ds.scan();
        let mut stream = scanner.try_into_stream().await?;

        let mut total_rows = 0;
        while let Some(batch_result) = stream.next().await {
            let batch = batch_result?;
            total_rows += batch.num_rows();
        }

        println!(
            "Lance dataset created successfully with {} rows",
            total_rows
        );
        println!("Dataset location: {}", lance_dataset_path);

        Ok(())
    })
}
```

----------------------------------------

TITLE: Build and Test Pylance Python Package
DESCRIPTION: These commands set up a Python virtual environment, install maturin for Rust-Python binding, build the Pylance package in debug mode, and then run its associated tests.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_3

LANGUAGE: bash
CODE:
```
cd python
python3 -m venv venv
source venv/bin/activate

pip install maturin

# Build debug build
maturin develop --extras tests

# Run pytest
pytest python/tests/
```

----------------------------------------

TITLE: Install Lance using Cargo
DESCRIPTION: Installs the Lance Rust library as a command-line tool using the Cargo package manager.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_0

LANGUAGE: shell
CODE:
```
cargo install lance
```

----------------------------------------

TITLE: Build pylance in release mode for benchmarks
DESCRIPTION: Builds the `pylance` module in release mode with debug symbols, enabling benchmark execution and profiling. It includes benchmark-specific extras and features for data generation.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_10

LANGUAGE: shell
CODE:
```
maturin develop --profile release-with-debug --extras benchmarks --features datagen
```

----------------------------------------

TITLE: Query Lance Dataset with Simple SQL in Rust DataFusion
DESCRIPTION: This Rust example demonstrates how to register a Lance dataset as a table in DataFusion using `LanceTableProvider` and execute a simple SQL `SELECT` query to retrieve the first 10 rows. It shows the basic setup for integrating Lance with DataFusion.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_0

LANGUAGE: rust
CODE:
```
use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;

let ctx = SessionContext::new();

ctx.register_table("dataset",
    Arc::new(LanceTableProvider::new(
    Arc::new(dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

let df = ctx.sql("SELECT * FROM dataset LIMIT 10").await?;
let result = df.collect().await?;
```

----------------------------------------

TITLE: Run LanceDB code formatters
DESCRIPTION: Applies code formatting rules to the entire project. Specific commands like `make format-python` or `cargo fmt` can be used for language-specific formatting.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_4

LANGUAGE: shell
CODE:
```
make format
```

----------------------------------------

TITLE: Build and Search HNSW Index for Vector Similarity in Rust
DESCRIPTION: This Rust code provides a complete example for vector similarity search. It defines a `ground_truth` function for L2 distance calculation, `create_test_vector_dataset` to generate synthetic fixed-size list vectors, and a `main` function that orchestrates the process. The `main` function generates or loads a dataset, builds an HNSW index using `lance_index::vector::hnsw`, and then performs vector searches, measuring construction and search times, and calculating recall against ground truth.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/hnsw.md#_snippet_0

LANGUAGE: Rust
CODE:
```
use std::collections::HashSet;
use std::sync::Arc;

use arrow::array::{types::Float32Type, Array, FixedSizeListArray};
use arrow::array::{AsArray, FixedSizeListBuilder, Float32Builder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchIterator;
use arrow_select::concat::concat;
use futures::stream::StreamExt;
use lance::Dataset;
use lance_index::vector::v3::subindex::IvfSubIndex;
use lance_index::vector::{
    flat::storage::FlatFloatStorage,
    hnsw::{builder::HnswBuildParams, HNSW},
};
use lance_linalg::distance::DistanceType;

fn ground_truth(fsl: &FixedSizeListArray, query: &[f32], k: usize) -> HashSet<u32> {
    let mut dists = vec![];
    for i in 0..fsl.len() {
        let dist = lance_linalg::distance::l2_distance(
            query,
            fsl.value(i).as_primitive::<Float32Type>().values(),
        );
        dists.push((dist, i as u32));
    }
    dists.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap());
    dists.truncate(k);
    dists.into_iter().map(|(_, i)| i).collect()
}

pub async fn create_test_vector_dataset(output: &str, num_rows: usize, dim: i32) {
    let schema = Arc::new(Schema::new(vec![Field::new(
        "vector",
        DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), dim),
        false,
    )]));

    let mut batches = Vec::new();

    // Create a few batches
    for _ in 0..2 {
        let v_builder = Float32Builder::new();
        let mut list_builder = FixedSizeListBuilder::new(v_builder, dim);

        for _ in 0..num_rows {
            for _ in 0..dim {
                list_builder.values().append_value(rand::random::<f32>());
            }
            list_builder.append(true);
        }
        let array = Arc::new(list_builder.finish());
        let batch = RecordBatch::try_new(schema.clone(), vec![array]).unwrap();
        batches.push(batch);
    }
    let batch_reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema.clone());
    println!("Writing dataset to {}", output);
    Dataset::write(batch_reader, output, None).await.unwrap();
}

#[tokio::main]
async fn main() {
    let uri: Option<String> = None; // None means generate test data
    let column = "vector";
    let ef = 100;
    let max_edges = 30;
    let max_level = 7;

    // 1. Generate a synthetic test data of specified dimensions
    let dataset = if uri.is_none() {
        println!("No uri is provided, generating test dataset...");
        let output = "test_vectors.lance";
        create_test_vector_dataset(output, 1000, 64).await;
        Dataset::open(output).await.expect("Failed to open dataset")
    } else {
        Dataset::open(uri.as_ref().unwrap())
            .await
            .expect("Failed to open dataset")
    };

    println!("Dataset schema: {:#?}", dataset.schema());
    let batches = dataset
        .scan()
        .project(&[column])
        .unwrap()
        .try_into_stream()
        .await
        .unwrap()
        .then(|batch| async move { batch.unwrap().column_by_name(column).unwrap().clone() })
        .collect::<Vec<_>>()
        .await;
    let arrs = batches.iter().map(|b| b.as_ref()).collect::<Vec<_>>();
    let fsl = concat(&arrs).unwrap().as_fixed_size_list().clone();
    println!("Loaded {:?} batches", fsl.len());

    let vector_store = Arc::new(FlatFloatStorage::new(fsl.clone(), DistanceType::L2));

    let q = fsl.value(0);
    let k = 10;
    let gt = ground_truth(&fsl, q.as_primitive::<Float32Type>().values(), k);

    for ef_construction in [15, 30, 50] {
        let now = std::time::Instant::now();
        // 2. Build a hierarchical graph structure for efficient vector search using Lance API
        let hnsw = HNSW::index_vectors(
            vector_store.as_ref(),
            HnswBuildParams::default()
                .max_level(max_level)
                .num_edges(max_edges)
                .ef_construction(ef_construction),
        )
        .unwrap();
        let construct_time = now.elapsed().as_secs_f32();
        let now = std::time::Instant::now();
        // 3. Perform vector search with different parameters and compute the ground truth using L2 distance search
        let results: HashSet<u32> = hnsw
            .search_basic(q.clone(), k, ef, None, vector_store.as_ref())
            .unwrap()
            .iter()
            .map(|node| node.id)
            .collect();
        let search_time = now.elapsed().as_micros();
        println!(
            "level={}, ef_construct={}, ef={} recall={}: construct={:.3}s search={:.3} us",
            max_level,
            ef_construction,
            ef,
            results.intersection(&gt).count() as f32 / k as f32,
            construct_time,
            search_time
        );
    }
}
```

----------------------------------------

TITLE: Compare LanceDB benchmarks against previous version
DESCRIPTION: Provides a sequence of commands to compare the performance of the current version against the `main` branch. This involves saving a baseline from `main` and then comparing the current branch's performance against it.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_15

LANGUAGE: shell
CODE:
```
CURRENT_BRANCH=$(git branch --show-current)
```

LANGUAGE: shell
CODE:
```
git checkout main
```

LANGUAGE: shell
CODE:
```
maturin develop --profile release-with-debug  --features datagen
```

LANGUAGE: shell
CODE:
```
pytest --benchmark-save=baseline python/benchmarks -m "not slow"
```

LANGUAGE: shell
CODE:
```
COMPARE_ID=$(ls .benchmarks/*/ | tail -1 | cut -c1-4)
```

LANGUAGE: shell
CODE:
```
git checkout $CURRENT_BRANCH
```

LANGUAGE: shell
CODE:
```
maturin develop --profile release-with-debug  --features datagen
```

LANGUAGE: shell
CODE:
```
pytest --benchmark-compare=$COMPARE_ID python/benchmarks -m "not slow"
```

----------------------------------------

TITLE: Build Rust core format (debug)
DESCRIPTION: This command compiles the Rust core format in debug mode. The debug build includes debugging information and is suitable for development and testing, though it is not optimized for performance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_4

LANGUAGE: bash
CODE:
```
cargo build
```

----------------------------------------

TITLE: Format and lint Rust code
DESCRIPTION: These commands are used to automatically format Rust code according to community standards (`cargo fmt`) and to perform static analysis for potential issues (`cargo clippy`). This ensures code quality and consistency.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_3

LANGUAGE: bash
CODE:
```
cargo fmt --all
cargo clippy --all-features --tests --benches
```

----------------------------------------

TITLE: Run LanceDB code linters
DESCRIPTION: Executes code linters to check for style violations and potential issues. Language-specific linting can be performed with `make lint-python` or `make lint-rust`.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_5

LANGUAGE: shell
CODE:
```
make lint
```

----------------------------------------

TITLE: Clean LanceDB build artifacts
DESCRIPTION: Removes all generated build artifacts and temporary files from the project directory, useful for a clean rebuild.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_9

LANGUAGE: shell
CODE:
```
make clean
```

----------------------------------------

TITLE: Rust: Load Tokenizer from Hugging Face Hub
DESCRIPTION: This function provides a utility to load a tokenizer from the Hugging Face Hub. It takes a model name, creates an API client, retrieves the tokenizer file from the specified repository, and constructs a `Tokenizer` object from it. This is a common pattern for integrating Hugging Face models.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_3

LANGUAGE: Rust
CODE:
```
fn load_tokenizer(model_name: &str) -> Result<Tokenizer, Box<dyn Error + Send + Sync>> {
    let api = Api::new()?;
    let repo = api.repo(Repo::with_revision(
        model_name.into(),
        RepoType::Model,
        "main".into(),
    ));

    let tokenizer_path = repo.get("tokenizer.json")?;
    let tokenizer = Tokenizer::from_file(tokenizer_path)?;

    Ok(tokenizer)
}
```

----------------------------------------

TITLE: Build MacOS x86_64 Wheels
DESCRIPTION: This command builds release-mode wheels specifically for x86_64 MacOS. It uses `maturin` to compile the project for the `x86_64-apple-darwin` target, storing the resulting wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_26

LANGUAGE: Shell
CODE:
```
maturin build --release \
    --target x86_64-apple-darwin \
    --out wheels
```

----------------------------------------

TITLE: Build and Test Lance Rust Package
DESCRIPTION: These commands clone the Lance repository, navigate to the Rust directory, and then build, test, and benchmark the core Rust components of Lance.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_2

LANGUAGE: bash
CODE:
```
git checkout https://github.com/lancedb/lance.git

# Build rust package
cd rust
cargo build

# Run test
cargo test

# Run benchmarks
cargo bench
```

----------------------------------------

TITLE: Build LanceDB in development mode
DESCRIPTION: Builds the Rust native module in place using `maturin`. This command needs to be re-run whenever Rust code changes, but is not required for Python code modifications.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_0

LANGUAGE: shell
CODE:
```
maturin develop
```

----------------------------------------

TITLE: Download Lindera Language Model
DESCRIPTION: Command-line instruction to download a specific Lindera language model (e.g., ipadic, ko-dic, unidic) for LanceDB. Note that `lindera-cli` must be installed beforehand as Lindera models require compilation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_4

LANGUAGE: bash
CODE:
```
python -m lance.download lindera -l [ipadic|ko-dic|unidic]
```

----------------------------------------

TITLE: Decorate Rust Unit Test for Tracing
DESCRIPTION: To enable tracing for a Rust unit test, decorate it with the `#[lance_test_macros::test]` attribute. This macro wraps any existing test attributes, allowing tracing information to be collected during test execution.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_16

LANGUAGE: Rust
CODE:
```
#[lance_test_macros::test(tokio::test)]
async fn test() {
    ...
}
```

----------------------------------------

TITLE: Add Rust Toolchain Targets for Cross-Compilation
DESCRIPTION: To build manylinux wheels for different Linux architectures, you must first add the corresponding Rust toolchain targets. These commands add the x86_64 and aarch64 GNU targets, enabling cross-compilation.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_22

LANGUAGE: Shell
CODE:
```
rustup target add x86_64-unknown-linux-gnu
rustup target add aarch64-unknown-linux-gnu
```

----------------------------------------

TITLE: Build MacOS ARM64 Wheels
DESCRIPTION: This command builds release-mode wheels specifically for ARM64 (aarch64) MacOS. It uses `maturin` to compile the project for the `aarch64-apple-darwin` target, storing the resulting wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_25

LANGUAGE: Shell
CODE:
```
maturin build --release \
    --target aarch64-apple-darwin \
    --out wheels
```

----------------------------------------

TITLE: Rust: WikiTextBatchReader Next Batch Logic
DESCRIPTION: This snippet shows the core logic for the `next` method of the `WikiTextBatchReader`. It attempts to build and retrieve the next Parquet reader from a list of available readers. If a reader is successfully built, it's used; otherwise, it handles errors or indicates that no more readers are available.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_1

LANGUAGE: Rust
CODE:
```
                if let Some(builder) = self.parquet_readers[self.current_reader_idx].take() {
                    match builder.build() {
                        Ok(reader) => {
                            self.current_reader = Some(Box::new(reader));
                            self.current_reader_idx += 1;
                            continue;
                        }
                        Err(e) => {
                            return Some(Err(arrow::error::ArrowError::ExternalError(Box::new(e))))
                        }
                    }
                }
            }

            // No more readers available
            return None;
        }
```

----------------------------------------

TITLE: Run Rust Unit Test with Tracing Verbosity
DESCRIPTION: Execute a Rust unit test with tracing enabled by setting the `LANCE_TESTING` environment variable to a desired verbosity level (e.g., 'debug', 'info'). This command will generate a JSON trace file in your working directory, which can be viewed in Chrome or Perfetto.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_17

LANGUAGE: Bash
CODE:
```
LANCE_TESTING=debug cargo test dataset::tests::test_create_dataset
```

----------------------------------------

TITLE: Build Linux x86_64 Manylinux Wheels
DESCRIPTION: This command builds release-mode manylinux wheels for x86_64 Linux. It utilizes `maturin` with `zig` for cross-compilation, targeting `manylinux2014` compatibility, and outputs the generated wheels to the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_23

LANGUAGE: Shell
CODE:
```
maturin build --release --zig \
    --target x86_64-unknown-linux-gnu \
    --compatibility manylinux2014 \
    --out wheels
```

----------------------------------------

TITLE: Build Linux ARM64 Manylinux Wheels
DESCRIPTION: This command builds release-mode manylinux wheels for ARM64 (aarch64) Linux. It uses `maturin` with `zig` for cross-compilation, targeting `manylinux2014` compatibility, and places the output wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_24

LANGUAGE: Shell
CODE:
```
maturin build --release --zig \
    --target aarch_64-unknown-linux-gnu \
    --compatibility manylinux2014 \
    --out wheels
```

----------------------------------------

TITLE: Join Multiple Lance Datasets with SQL in Rust DataFusion
DESCRIPTION: This Rust example illustrates how to register multiple Lance datasets (e.g., 'orders' and 'customers') as separate tables in DataFusion. It then performs a SQL `JOIN` operation between these tables to combine data based on a common key, demonstrating more complex query capabilities.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_1

LANGUAGE: rust
CODE:
```
use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;

let ctx = SessionContext::new();

ctx.register_table("orders",
    Arc::new(LanceTableProvider::new(
    Arc::new(orders_dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

ctx.register_table("customers",
    Arc::new(LanceTableProvider::new(
    Arc::new(customers_dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

let df = ctx.sql("
    SELECT o.order_id, o.amount, c.customer_name
    FROM orders o
    JOIN customers c ON o.customer_id = c.customer_id
    LIMIT 10
").await?;

let result = df.collect().await?;
```

----------------------------------------

TITLE: Generate Flame Graph from Process ID
DESCRIPTION: Generates a flame graph for a running process using its Process ID (PID). This command is used to capture and visualize CPU profiles, helping to identify performance bottlenecks in an application.

SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_5

LANGUAGE: sh
CODE:
```
flamegraph -p <PID>
```

----------------------------------------

TITLE: Clone LanceDB GitHub Repository
DESCRIPTION: Instructions to clone the LanceDB project repository from GitHub to a local machine. This is the first step for setting up the development environment.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_11

LANGUAGE: shell
CODE:
```
git clone https://github.com/lancedb/lance.git
```

----------------------------------------

TITLE: Rust Implementation of WikiTextBatchReader
DESCRIPTION: This Rust code defines `WikiTextBatchReader`, a custom implementation of `arrow::record_batch::RecordBatchReader`. It's designed to read text data from Parquet files, tokenize it using a `Tokenizer` from the `tokenizers` crate, and transform it into Arrow `RecordBatch`es. The `process_batch` method handles tokenization, limits the number of samples, and shuffles the tokenized IDs before creating the final `RecordBatch`.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_0

LANGUAGE: rust
CODE:
```
use arrow::array::{Array, Int64Builder, ListBuilder, UInt32Array};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchReader;
use futures::StreamExt;
use hf_hub::{api::sync::Api, Repo, RepoType};
use lance::dataset::WriteParams;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use rand::seq::SliceRandom;
use rand::SeedableRng;
use std::error::Error;
use std::fs::File;
use std::io::Write;
use std::sync::Arc;
use tempfile::NamedTempFile;
use tokenizers::Tokenizer;

// Implement a custom stream batch reader
struct WikiTextBatchReader {
    schema: Arc<Schema>,
    parquet_readers: Vec<Option<ParquetRecordBatchReaderBuilder<File>>>,
    current_reader_idx: usize,
    current_reader: Option<Box<dyn RecordBatchReader + Send>>,
    tokenizer: Tokenizer,
    num_samples: u64,
    cur_samples_cnt: u64,
}

impl WikiTextBatchReader {
    fn new(
        parquet_readers: Vec<ParquetRecordBatchReaderBuilder<File>>,
        tokenizer: Tokenizer,
        num_samples: Option<u64>,
    ) -> Result<Self, Box<dyn Error + Send + Sync>> {
        let schema = Arc::new(Schema::new(vec![Field::new(
            "input_ids",
            DataType::List(Arc::new(Field::new("item", DataType::Int64, true))),
            false,
        )]));

        Ok(Self {
            schema,
            parquet_readers: parquet_readers.into_iter().map(Some).collect(),
            current_reader_idx: 0,
            current_reader: None,
            tokenizer,
            num_samples: num_samples.unwrap_or(100_000),
            cur_samples_cnt: 0,
        })
    }

    fn process_batch(
        &mut self,
        input_batch: &RecordBatch,
    ) -> Result<RecordBatch, arrow::error::ArrowError> {
        let num_rows = input_batch.num_rows();
        let mut token_builder = ListBuilder::new(Int64Builder::with_capacity(num_rows * 1024)); // Pre-allocate space
        let mut should_break = false;

        let column = input_batch.column_by_name("text").unwrap();
        let string_array = column
            .as_any()
            .downcast_ref::<arrow::array::StringArray>()
            .unwrap();
        for i in 0..num_rows {
            if self.cur_samples_cnt >= self.num_samples {
                should_break = true;
                break;
            }
            if !Array::is_null(string_array, i) {
                let text = string_array.value(i);
                // Split paragraph into lines
                for line in text.split('
') {
                    if let Ok(encoding) = self.tokenizer.encode(line, true) {
                        let tb_values = token_builder.values();
                        for &id in encoding.get_ids() {
                            tb_values.append_value(id as i64);
                        }
                        token_builder.append(true);
                        self.cur_samples_cnt += 1;
                        if self.cur_samples_cnt % 5000 == 0 {
                            println!("Processed {} rows", self.cur_samples_cnt);
                        }
                        if self.cur_samples_cnt >= self.num_samples {
                            should_break = true;
                            break;
                        }
                    }
                }
            }
        }

        // Create array and shuffle it
        let input_ids_array = token_builder.finish();

        // Create shuffled array by randomly sampling indices
        let mut rng = rand::rngs::StdRng::seed_from_u64(1337);
        let len = input_ids_array.len();
        let mut indices: Vec<u32> = (0..len as u32).collect();
        indices.shuffle(&mut rng);

        // Take values in shuffled order
        let indices_array = UInt32Array::from(indices);
        let shuffled = arrow::compute::take(&input_ids_array, &indices_array, None)?;

        let batch = RecordBatch::try_new(self.schema.clone(), vec![Arc::new(shuffled)]);
        if should_break {
            println!("Stop at {} rows", self.cur_samples_cnt);
            self.parquet_readers.clear();
            self.current_reader = None;
        }

        batch
    }
}

impl RecordBatchReader for WikiTextBatchReader {
    fn schema(&self) -> Arc<Schema> {
        self.schema.clone()
    }
}

impl Iterator for WikiTextBatchReader {
    type Item = Result<RecordBatch, arrow::error::ArrowError>;
    fn next(&mut self) -> Option<Self::Item> {
        loop {
            // If we have a current reader, try to get next batch
            if let Some(reader) = &mut self.current_reader {
                if let Some(batch_result) = reader.next() {
                    return Some(batch_result.and_then(|batch| self.process_batch(&batch)));
                }
            }

            // If no current reader or current reader is exhausted, try to get next reader
            if self.current_reader_idx < self.parquet_readers.len() {
```

----------------------------------------

TITLE: Set DYLD_LIBRARY_PATH for Lance Python Debugging in LLDB
DESCRIPTION: Configures the `DYLD_LIBRARY_PATH` environment variable specifically for debugging Lance Python projects within LLDB. This ensures that the dynamic linker can find necessary shared libraries located in the third-party distribution directory.

SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_1

LANGUAGE: lldb
CODE:
```
# /path/to/lance/python/.lldbinit
env DYLD_LIBRARY_PATH=/path/to/thirdparty/dist/lib:${DYLD_LIBRARY_PATH}
```

----------------------------------------

TITLE: Download and extract MeCab Ipadic model
DESCRIPTION: This snippet downloads the gzipped tarball of the MeCab Ipadic model from GitHub and then extracts its contents using tar. This is the first step in preparing the dictionary for building.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_0

LANGUAGE: bash
CODE:
```
curl -L -o mecab-ipadic-2.7.0-20070801.tar.gz "https://github.com/lindera-morphology/mecab-ipadic/archive/refs/tags/2.7.0-20070801.tar.gz"
tar xvf mecab-ipadic-2.7.0-20070801.tar.gz
```

----------------------------------------

TITLE: Build user dictionary with Lindera
DESCRIPTION: This command demonstrates how to build a custom user dictionary using 'lindera build'. It takes a CSV file as input and creates a new user dictionary, which can be used to extend the base language model.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_2

LANGUAGE: bash
CODE:
```
lindera build --build-user-dictionary --dictionary-kind=ipadic user_dict/userdict.csv user_dict2
```

----------------------------------------

TITLE: Download Jieba Language Model
DESCRIPTION: Command-line instruction to download the Jieba language model for use with LanceDB. The model will be automatically stored in the default Jieba model directory within the configured language model home.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_1

LANGUAGE: bash
CODE:
```
python -m lance.download jieba
```

----------------------------------------

TITLE: Read and Inspect Lance Dataset in Rust
DESCRIPTION: This Rust function `read_dataset` shows how to open an existing Lance dataset from a given path. It uses a `scanner` to create a `batch_stream` and then iterates through each `RecordBatch`, printing its number of rows, columns, schema, and the entire batch content.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_1

LANGUAGE: Rust
CODE:
```
// Reads dataset from the given path and prints batch size, schema for all record batches. Also extracts and prints a slice from the first batch
async fn read_dataset(data_path: &str) {
    let dataset = Dataset::open(data_path).await.unwrap();
    let scanner = dataset.scan();

    let mut batch_stream = scanner.try_into_stream().await.unwrap().map(|b| b.unwrap());

    while let Some(batch) = batch_stream.next().await {
        println!("Batch size: {}, {}", batch.num_rows(), batch.num_columns()); // print size of batch
        println!("Schema: {:?}", batch.schema()); // print schema of recordbatch

        println!("Batch: {:?}", batch); // print the entire recordbatch (schema and data)
    }
} // End read dataset
```

----------------------------------------

TITLE: Create a Lance Dataset from Arrow RecordBatches in Rust
DESCRIPTION: Demonstrates how to write a collection of Arrow RecordBatches and an Arrow Schema into a new Lance Dataset. It uses default write parameters and an iterator for the batches.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_1

LANGUAGE: rust
CODE:
```
use lance::{dataset::WriteParams, Dataset};

let write_params = WriteParams::default();
let mut reader = RecordBatchIterator::new(
    batches.into_iter().map(Ok),
    schema
);
Dataset::write(reader, &uri, Some(write_params)).await.unwrap();
```

----------------------------------------

TITLE: Build Ipadic language model with Lindera
DESCRIPTION: This command uses the 'lindera build' tool to compile the Ipadic dictionary. It specifies the dictionary kind as 'ipadic' and points to the extracted model directory to create the main dictionary.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_1

LANGUAGE: bash
CODE:
```
lindera build --dictionary-kind=ipadic mecab-ipadic-2.7.0-20070801 main
```

----------------------------------------

TITLE: Write Lance Dataset in Rust
DESCRIPTION: This Rust function `write_dataset` demonstrates how to create and write a Lance dataset to a specified path. It defines a schema with `UInt32` fields, creates a `RecordBatch` with sample data, and uses `WriteParams` to set the write mode to `Overwrite` before writing the dataset to disk.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_0

LANGUAGE: Rust
CODE:
```
// Writes sample dataset to the given path
async fn write_dataset(data_path: &str) {
    // Define new schema
    let schema = Arc::new(Schema::new(vec![
        Field::new("key", DataType::UInt32, false),
        Field::new("value", DataType::UInt32, false),
    ]));

    // Create new record batches
    let batch = RecordBatch::try_new(
        schema.clone(),
        vec![
            Arc::new(UInt32Array::from(vec![1, 2, 3, 4, 5, 6])),
            Arc::new(UInt32Array::from(vec![6, 7, 8, 9, 10, 11])),
        ],
    )
    .unwrap();

    let batches = RecordBatchIterator::new([Ok(batch)], schema.clone());

    // Define write parameters (e.g. overwrite dataset)
    let write_params = WriteParams {
        mode: WriteMode::Overwrite,
        ..Default::default()
    };

    Dataset::write(batches, data_path, Some(write_params))
        .await
        .unwrap();
} // End write dataset
```

----------------------------------------

TITLE: Build LanceDB Rust JNI Module
DESCRIPTION: Specifies the command to build only the Rust-based JNI (Java Native Interface) module of LanceDB. This is useful for developers focusing on the native components without rebuilding the entire Java project.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_10

LANGUAGE: shell
CODE:
```
cargo build
```

----------------------------------------

TITLE: Read a Lance Dataset and Collect RecordBatches in Rust
DESCRIPTION: Opens an existing Lance Dataset from a specified path, scans its content, and collects all resulting RecordBatches into a vector. Error handling is included.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_2

LANGUAGE: rust
CODE:
```
let dataset = Dataset::open(path).await.unwrap();
let mut scanner = dataset.scan();
let batches: Vec<RecordBatch> = scanner
    .try_into_stream()
    .await
    .unwrap()
    .map(|b| b.unwrap())
    .collect::<Vec<RecordBatch>>()
    .await;
```

----------------------------------------

TITLE: Create a Vector Index on a Lance Dataset in Rust
DESCRIPTION: Demonstrates how to create a vector index on a specified column (e.g., 'embeddings') within a Lance Dataset. It configures vector index parameters like the number of partitions and sub-vectors, noting potential alignment requirements.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_4

LANGUAGE: rust
CODE:
```
use ::lance::index::vector::VectorIndexParams;

let params = VectorIndexParams::default();
params.num_partitions = 256;
params.num_sub_vectors = 16;

// this will Err if list_size(embeddings) / num_sub_vectors does not meet simd alignment
dataset.create_index(&["embeddings"], IndexType::Vector, None, &params, true).await;
```

----------------------------------------

TITLE: Retrieve Specific Records from a Lance Dataset in Rust
DESCRIPTION: Retrieves specific records from a Lance Dataset based on their indices and a projection. The result is a RecordBatch containing the requested data.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_3

LANGUAGE: rust
CODE:
```
let values: Result<RecordBatch> = dataset.take(&[200, 199, 39, 40, 100], &projection).await;
```