herodb/specs/backgroundinfo/lancedb.md

========================
CODE SNIPPETS
========================
TITLE: Run LanceDB documentation examples tests
DESCRIPTION: Checks the documentation examples for correctness and consistency, ensuring they function as expected.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_3

LANGUAGE: shell
CODE:
```
make doctest
```

----------------------------------------

TITLE: Install documentation website requirements
DESCRIPTION: This command installs the necessary Python packages for building the main documentation website, which is powered by `mkdocs-material`. It ensures all dependencies are met before serving the docs.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_7

LANGUAGE: bash
CODE:
```
pip install -r docs/requirements.txt
```

----------------------------------------

TITLE: Build and serve documentation website locally
DESCRIPTION: These commands navigate to the `docs` directory and start a local development server for the documentation website. This allows contributors to preview changes to the documentation in real-time.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_8

LANGUAGE: bash
CODE:
```
cd docs
mkdocs serve
```

----------------------------------------

TITLE: Perform Python development installation
DESCRIPTION: These commands navigate into the `python` directory and perform a development installation of the Lance Python bindings. This allows developers to import and test changes to the Python wrapper directly.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_1

LANGUAGE: bash
CODE:
```
cd python
maturin develop
```

----------------------------------------

TITLE: Example output of git commit with pre-commit hooks
DESCRIPTION: Demonstrates the console output when committing changes after pre-commit hooks are installed, showing the execution and status of linters like black, isort, and ruff.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_8

LANGUAGE: shell
CODE:
```
git commit -m"Changed some python files"
black....................................................................Passed
isort (python)...........................................................Passed
ruff.....................................................................Passed
[main daf91ed] Changed some python files
 1 file changed, 1 insertion(+), 1 deletion(-)
```

----------------------------------------

TITLE: Install LanceDB test dependencies
DESCRIPTION: Installs the necessary Python packages for running tests, including optional test dependencies specified in the project's setup.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_1

LANGUAGE: shell
CODE:
```
pip install '.[tests]'
```

----------------------------------------

TITLE: Install pre-commit tool for LanceDB
DESCRIPTION: Installs the `pre-commit` tool, which enables running formatters and linters automatically before each Git commit.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_6

LANGUAGE: shell
CODE:
```
pip install pre-commit
```

----------------------------------------

TITLE: Download and Extract SIFT 1M Dataset
DESCRIPTION: This snippet provides shell commands to download and extract the SIFT 1M dataset, which is used as a large-scale example for vector search demonstrations. It includes commands to clean up previous downloads and extract the compressed archive.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_11

LANGUAGE: bash
CODE:
```
rm -rf sift* vec_data.lance
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz
```

----------------------------------------

TITLE: Create Pandas DataFrame
DESCRIPTION: This code demonstrates how to create a simple Pandas DataFrame. This DataFrame serves as a basic example for subsequent operations, such as writing data to a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_1

LANGUAGE: python
CODE:
```
df = pd.DataFrame({"a": [5]})
df
```

----------------------------------------

TITLE: TPCH Benchmark Setup and Execution
DESCRIPTION: This snippet outlines the steps to set up the dataset and run the TPCH Q1 benchmark comparing LanceDB and Parquet. It includes navigating to the benchmark directory, creating a dataset folder, downloading and renaming the necessary Parquet file, and executing the benchmark script. Note: The step to 'generate lance file' is a conceptual action within the benchmark process, not an explicit command provided.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/tpch/README.md#_snippet_0

LANGUAGE: Shell
CODE:
```
cd lance/benchmarks/tpch
mkdir dataset && cd dataset
wget https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet -O lineitem_sf1.parquet
cd ..
```

LANGUAGE: Shell
CODE:
```
python3 benchmark.py q1
```

----------------------------------------

TITLE: Install LanceDB pre-commit hooks
DESCRIPTION: Installs the pre-commit hooks defined in the project's configuration, activating automatic linting and formatting on commit attempts.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_7

LANGUAGE: shell
CODE:
```
pre-commit install
```

----------------------------------------

TITLE: Install Python bindings build tool
DESCRIPTION: This command installs `maturin`, a tool essential for building Python packages that integrate with Rust code. It's a prerequisite for setting up the Python development environment for Lance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_0

LANGUAGE: bash
CODE:
```
pip install maturin
```

----------------------------------------

TITLE: Start Local Services for S3 Integration Tests
DESCRIPTION: Before running S3 integration tests, you need to start local Minio and DynamoDB services. This command uses Docker Compose to bring up these required services, ensuring the test environment is ready.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_20

LANGUAGE: Shell
CODE:
```
docker compose up
```

----------------------------------------

TITLE: Install preview pylance Python SDK via pip
DESCRIPTION: Install the preview version of the pylance Python SDK to access the latest features and bug fixes. This uses a specific extra index URL for LanceDB's PyPI.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/install.md#_snippet_1

LANGUAGE: Bash
CODE:
```
pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance
```

----------------------------------------

TITLE: Access Specific Lance Dataset Version
DESCRIPTION: This example demonstrates how to load and query a specific historical version of a Lance dataset. By specifying the `version` parameter, users can access data as it existed at a particular point in time, enabling historical analysis or rollbacks.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_8

LANGUAGE: python
CODE:
```
# Version 1
lance.dataset('/tmp/test.lance', version=1).to_table().to_pandas()

# Version 2
lance.dataset('/tmp/test.lance', version=2).to_table().to_pandas()
```

----------------------------------------

TITLE: Install stable pylance Python SDK via pip
DESCRIPTION: Install the stable and recommended version of the pylance Python SDK using the pip package manager.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/install.md#_snippet_0

LANGUAGE: Bash
CODE:
```
pip install pylance
```

----------------------------------------

TITLE: Run all LanceDB tests
DESCRIPTION: Executes the full test suite for the LanceDB project using the `make test` command.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_2

LANGUAGE: shell
CODE:
```
make test
```

----------------------------------------

TITLE: Install Linux Perf Tools and Configure Kernel Parameters
DESCRIPTION: Installs necessary Linux performance tools (`perf`) on Ubuntu systems and configures the `perf_event_paranoid` kernel parameter. This setup is crucial for allowing non-root users to collect performance data using tools like `perf` and `flamegraph`.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/Debug.md#_snippet_4

LANGUAGE: sh
CODE:
```
sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`
sudo sh -c "echo -1 >  /proc/sys/kernel/perf_event_paranoid"
```

----------------------------------------

TITLE: Load Lance Vector Dataset
DESCRIPTION: This snippet shows how to load a previously created Lance vector dataset. This step is essential before performing any vector search queries or other operations on the dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_13

LANGUAGE: python
CODE:
```
uri = "vec_data.lance"
sift1m = lance.dataset(uri)
```

----------------------------------------

TITLE: Prepare Parquet File from Pandas DataFrame
DESCRIPTION: This code prepares a Parquet file from a Pandas DataFrame using PyArrow. It cleans up any existing Parquet or Lance files to ensure a fresh start, then converts the DataFrame to a PyArrow Table and writes it as a Parquet dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_3

LANGUAGE: python
CODE:
```
shutil.rmtree("/tmp/test.parquet", ignore_errors=True)
shutil.rmtree("/tmp/test.lance", ignore_errors=True)

tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, "/tmp/test.parquet", format='parquet')

parquet = pa.dataset.dataset("/tmp/test.parquet")
parquet.to_table().to_pandas()
```

----------------------------------------

TITLE: Install required Python libraries
DESCRIPTION: Installs necessary Python packages for data handling, OpenAI API interaction, rate limiting, and LanceDB. The `--quiet` flag suppresses verbose output during installation.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_0

LANGUAGE: python
CODE:
```
pip install --quiet openai tqdm ratelimiter retry datasets pylance
```

----------------------------------------

TITLE: Run Rust unit tests
DESCRIPTION: This command executes the unit tests for the Rust core format. Running these tests verifies the correctness of the Rust implementation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_6

LANGUAGE: bash
CODE:
```
cargo test
```

----------------------------------------

TITLE: Profile a LanceDB benchmark using flamegraph
DESCRIPTION: Generates a flamegraph for a specific benchmark using `cargo-flamegraph`, aiding in performance analysis. It's recommended to run benchmarks once beforehand to avoid setup time being captured in the profile.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_14

LANGUAGE: shell
CODE:
```
flamegraph -F 100 --no-inline -- $(which python) \
    -m pytest python/benchmarks \
    --benchmark-min-time=2 \
    -k test_ivf_pq_index_search
```

----------------------------------------

TITLE: Install Flamegraph Tool
DESCRIPTION: Installs the `flamegraph` profiling tool using Cargo, Rust's package manager. This tool is essential for visualizing CPU usage and call stacks as flame graphs for performance analysis.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/Debug.md#_snippet_3

LANGUAGE: sh
CODE:
```
cargo install flamegraph
```

----------------------------------------

TITLE: Set up BigANN Benchmark Environment
DESCRIPTION: This snippet provides commands to set up a Python virtual environment, clone the 'big-ann-benchmarks' repository, and install its required dependencies. It prepares the system for running BigANN benchmarks by ensuring all necessary tools and libraries are in place.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/bigann/README.md#_snippet_0

LANGUAGE: bash
CODE:
```
python -m venv venv
. ./venv/bin/activate
git clone https://github.com/harsha-simhadri/big-ann-benchmarks.git
cd big-ann-benchmarks
pip install -r requirements_py3.10.txt
```

----------------------------------------

TITLE: List Lance Dataset Versions
DESCRIPTION: This code shows how to retrieve a list of all available versions for a Lance dataset. This functionality is crucial for understanding the history of changes and for accessing specific historical states of the data.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_7

LANGUAGE: python
CODE:
```
dataset.versions()
```

----------------------------------------

TITLE: Install Lance Build Dependencies on Ubuntu
DESCRIPTION: This command installs necessary system-level dependencies for building Lance on Ubuntu 22.04, including protobuf, SSL development libraries, and general build tools.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_0

LANGUAGE: bash
CODE:
```
sudo apt install protobuf-compiler libssl-dev build-essential pkg-config gfortran
```

----------------------------------------

TITLE: Build Rust core format (release)
DESCRIPTION: This command compiles the Rust core format in release mode. The release build is optimized for performance and is suitable for production deployments or benchmarking.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_5

LANGUAGE: bash
CODE:
```
cargo build -r
```

----------------------------------------

TITLE: Debug Python Script with LLDB
DESCRIPTION: Demonstrates how to start an LLDB debugging session for a Python script. It involves launching LLDB with the Python interpreter from a virtual environment and then running the target script within the LLDB prompt.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/Debug.md#_snippet_2

LANGUAGE: sh
CODE:
```
$ lldb ./venv/bin/python
(lldb) r script.py
```

----------------------------------------

TITLE: Install Lance Build Dependencies on Mac
DESCRIPTION: This command installs the protobuf compiler using Homebrew, a required dependency for building Lance on macOS.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_1

LANGUAGE: bash
CODE:
```
brew install protobuf
```

----------------------------------------

TITLE: Configure LLDB Initialization Settings
DESCRIPTION: Sets up basic LLDB initialization settings in the `~/.lldbinit` file. This includes configuring the number of source code lines to display before and after a stop, and enabling the loading of `.lldbinit` files from the current working directory.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/Debug.md#_snippet_0

LANGUAGE: lldb
CODE:
```
# ~/.lldbinit
settings set stop-line-count-before 15
settings set stop-line-count-after 15
settings set target.load-cwd-lldbinit true
```

----------------------------------------

TITLE: List all versions of a Lance dataset
DESCRIPTION: Retrieves and displays the version history of the Lance dataset, showing all previous and current states of the data.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_9

LANGUAGE: Python
CODE:
```
dataset.versions()
```

----------------------------------------

TITLE: Load Lance Dataset
DESCRIPTION: Initializes a Lance dataset object from a specified URI, preparing it for subsequent operations like nearest neighbor searches.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_20

LANGUAGE: python
CODE:
```
sift1m = lance.dataset(uri)
```

----------------------------------------

TITLE: Complete Lance Dataset Write and Read Example in Rust
DESCRIPTION: This Rust `main` function provides a complete example demonstrating the usage of `write_dataset` and `read_dataset` functions. It sets up the necessary `arrow` and `lance` imports, defines a temporary data path, and orchestrates the writing and subsequent reading of a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_2

LANGUAGE: Rust
CODE:
```
use arrow::array::UInt32Array;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::{RecordBatch, RecordBatchIterator};
use futures::StreamExt;
use lance::dataset::{WriteMode, WriteParams};
use lance::Dataset;
use std::sync::Arc;

#[tokio::main]
async fn main() {
    let data_path: &str = "./temp_data.lance";

    write_dataset(data_path).await;
    read_dataset(data_path).await;
}
```

----------------------------------------

TITLE: Rust: Main Workflow for WikiText to LanceDB Ingestion
DESCRIPTION: This comprehensive example demonstrates the full data ingestion pipeline in Rust. It initializes a Tokio runtime, loads a tokenizer, sets up the Hugging Face API to download WikiText Parquet files, processes them into a `WikiTextBatchReader`, and finally writes the data to a Lance dataset. It also includes verification of the created dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_2

LANGUAGE: Rust
CODE:
```
fn main() -> Result<(), Box<dyn Error + Send + Sync>> {
    let rt = tokio::runtime::Runtime::new()?;
    rt.block_on(async {
        // Load tokenizer
        let tokenizer = load_tokenizer("gpt2")?;

        // Set up Hugging Face API
        // Download from https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-raw-v1
        let api = Api::new()?;
        let repo = api.repo(Repo::with_revision(
            "Salesforce/wikitext".into(),
            RepoType::Dataset,
            "main".into(),
        ));

        // Define the parquet files we want to download
        let train_files = vec![
            "wikitext-103-raw-v1/train-00000-of-00002.parquet",
            "wikitext-103-raw-v1/train-00001-of-00002.parquet",
        ];

        let mut parquet_readers = Vec::new();
        for file in &train_files {
            println!("Downloading file: {}", file);
            let file_path = repo.get(file)?;
            let data = std::fs::read(file_path)?;

            // Create a temporary file in the system temp directory and write the downloaded data to it
            let mut temp_file = NamedTempFile::new()?;
            temp_file.write_all(&data)?;

            // Create the parquet reader builder with a larger batch size
            let builder = ParquetRecordBatchReaderBuilder::try_new(temp_file.into_file())?
                .with_batch_size(8192); // Increase batch size for better performance
            parquet_readers.push(builder);
        }

        if parquet_readers.is_empty() {
            println!("No parquet files found to process.");
            return Ok(());
        }

        // Create batch reader
        let num_samples: u64 = 500_000;
        let batch_reader = WikiTextBatchReader::new(parquet_readers, tokenizer, Some(num_samples))?;

        // Save as Lance dataset
        println!("Writing to Lance dataset...");
        let lance_dataset_path = "rust_wikitext_lance_dataset.lance";

        let write_params = WriteParams::default();
        lance::Dataset::write(batch_reader, lance_dataset_path, Some(write_params)).await?;

        // Verify the dataset
        let ds = lance::Dataset::open(lance_dataset_path).await?;
        let scanner = ds.scan();
        let mut stream = scanner.try_into_stream().await?;

        let mut total_rows = 0;
        while let Some(batch_result) = stream.next().await {
            let batch = batch_result?;
            total_rows += batch.num_rows();
        }

        println!(
            "Lance dataset created successfully with {} rows",
            total_rows
        );
        println!("Dataset location: {}", lance_dataset_path);

        Ok(())
    })
}
```

----------------------------------------

TITLE: Build and Test Pylance Python Package
DESCRIPTION: These commands set up a Python virtual environment, install maturin for Rust-Python binding, build the Pylance package in debug mode, and then run its associated tests.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_3

LANGUAGE: bash
CODE:
```
cd python
python3 -m venv venv
source venv/bin/activate

pip install maturin

# Build debug build
maturin develop --extras tests

# Run pytest
pytest python/tests/
```

----------------------------------------

TITLE: Install Lance using Cargo
DESCRIPTION: Installs the Lance Rust library as a command-line tool using the Cargo package manager.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_0

LANGUAGE: shell
CODE:
```
cargo install lance
```

----------------------------------------

TITLE: Append Data to Lance Dataset
DESCRIPTION: This example illustrates how to append new data to an existing Lance dataset. It creates a new Pandas DataFrame, converts it to a PyArrow Table, and then uses `lance.write_dataset` with `mode="append"` to add the new rows, creating a new version of the dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_5

LANGUAGE: python
CODE:
```
df = pd.DataFrame({"a": [10]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="append")

dataset.to_table().to_pandas()
```

----------------------------------------

TITLE: Access Lance Dataset by Tag
DESCRIPTION: This code demonstrates how to load a Lance dataset using a previously defined tag instead of a numerical version. This allows for more intuitive access to specific, meaningful versions of the data, improving readability and maintainability.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_10

LANGUAGE: python
CODE:
```
lance.dataset('/tmp/test.lance', version="stable").to_table().to_pandas()
```

----------------------------------------

TITLE: Build pylance in release mode for benchmarks
DESCRIPTION: Builds the `pylance` module in release mode with debug symbols, enabling benchmark execution and profiling. It includes benchmark-specific extras and features for data generation.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_10

LANGUAGE: shell
CODE:
```
maturin develop --profile release-with-debug --extras benchmarks --features datagen
```

----------------------------------------

TITLE: Query Lance Dataset with Simple SQL in Rust DataFusion
DESCRIPTION: This Rust example demonstrates how to register a Lance dataset as a table in DataFusion using `LanceTableProvider` and execute a simple SQL `SELECT` query to retrieve the first 10 rows. It shows the basic setup for integrating Lance with DataFusion.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_0

LANGUAGE: rust
CODE:
```
use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;

let ctx = SessionContext::new();

ctx.register_table("dataset",
    Arc::new(LanceTableProvider::new(
    Arc::new(dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

let df = ctx.sql("SELECT * FROM dataset LIMIT 10").await?;
let result = df.collect().await?;
```

----------------------------------------

TITLE: Install Lance Preview Release
DESCRIPTION: Installs a preview release of the `pylance` library, which includes the latest features and bug fixes. Preview releases are published more frequently and offer early access to new developments.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_1

LANGUAGE: shell
CODE:
```
pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance
```

----------------------------------------

TITLE: Install LanceDB and Python Dependencies
DESCRIPTION: Installs specific versions of LanceDB, pandas, and duckdb required for running the benchmarks. This ensures compatibility and reproducibility of the benchmark results.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_0

LANGUAGE: sh
CODE:
```
pip lancedb==0.3.6
pip install pandas~=2.1.0
pip duckdb~=0.9.0
```

----------------------------------------

TITLE: Prepare HD-Vila Dataset with Python venv
DESCRIPTION: This snippet outlines the steps to set up a Python virtual environment, activate it, and install necessary dependencies from `requirements.txt` for the HD-Vila dataset. It ensures a clean and isolated environment for project dependencies.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/hd-vila/README.md#_snippet_0

LANGUAGE: python
CODE:
```
python3 -m venv venv
. ./venv/bin/activate
pip install -r requirements.txt
```

----------------------------------------

TITLE: Run Python unit and integration tests
DESCRIPTION: These commands execute the unit tests and integration tests for the Python components of the Lance project. Running these tests is crucial to ensure code changes do not introduce regressions.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_2

LANGUAGE: bash
CODE:
```
make test
make integtest
```

----------------------------------------

TITLE: Import necessary libraries for LanceDB operations
DESCRIPTION: This snippet imports `shutil`, `lance`, `numpy`, `pandas`, and `pyarrow` for file system operations, LanceDB interactions, numerical computing, data manipulation, and Arrow table handling, respectively.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_0

LANGUAGE: Python
CODE:
```
import shutil

import lance
import numpy as np
import pandas as pd
import pyarrow as pa
```

----------------------------------------

TITLE: Create a Pandas DataFrame for LanceDB
DESCRIPTION: Initializes a simple Pandas DataFrame with a single column 'a' and a value of 5. This DataFrame will be used as input for creating a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_1

LANGUAGE: Python
CODE:
```
df = pd.DataFrame({"a": [5]})
df
```

----------------------------------------

TITLE: Sample Query Vectors from Lance Dataset
DESCRIPTION: This code demonstrates how to sample a subset of vectors from the loaded Lance dataset to be used as query vectors for nearest neighbor search. It leverages DuckDB for efficient sampling of the vector column.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_14

LANGUAGE: python
CODE:
```
import duckdb
# Make sure DuckDB v0.7+ is installed
samples = duckdb.query("SELECT vector FROM sift1m USING SAMPLE 100").to_df().vector
```

----------------------------------------

TITLE: Execute Tunable Nearest Neighbor Search
DESCRIPTION: Demonstrates how to perform a nearest neighbor search with tunable parameters like 'nprobes' and 'refine_factor' to balance latency and recall. The result is converted to a Pandas DataFrame.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_22

LANGUAGE: python
CODE:
```
%%time

sift1m.to_table(
    nearest={
        "column": "vector",
        "q": samples[0],
        "k": 10,
        "nprobes": 10,
        "refine_factor": 5
    }
).to_pandas()
```

----------------------------------------

TITLE: Load SIFT vector dataset from Lance file
DESCRIPTION: Defines the URI for the Lance vector dataset and then loads it using `lance.dataset()`, making the SIFT 1M vector data accessible for further operations.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_16

LANGUAGE: Python
CODE:
```
uri = "vec_data.lance"
sift1m = lance.dataset(uri)
```

----------------------------------------

TITLE: Import LanceDB Libraries
DESCRIPTION: This snippet imports the necessary Python libraries for working with LanceDB, including `shutil` for file operations, `lance` for core LanceDB functionalities, `numpy` for numerical operations, `pandas` for data manipulation, and `pyarrow` for data interchange.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_0

LANGUAGE: python
CODE:
```
import shutil
import lance
import numpy as np
import pandas as pd
import pyarrow as pa
```

----------------------------------------

TITLE: Run all LanceDB benchmarks (including slow tests)
DESCRIPTION: Executes all performance benchmarks, including those marked as 'slow', which may take a longer time to complete.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_12

LANGUAGE: shell
CODE:
```
pytest python/benchmarks
```

----------------------------------------

TITLE: Prepare Python Virtual Environment for Benchmarks
DESCRIPTION: Creates and activates a Python virtual environment, then installs required packages from `requirements.txt`. This isolates project dependencies and ensures a clean execution environment for the benchmark scripts.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_2

LANGUAGE: sh
CODE:
```
python3 -m venv venv
. ./venv/bin/activate
pip install -r requirements.txt
```

----------------------------------------

TITLE: Create Tags for Lance Dataset Versions
DESCRIPTION: This snippet illustrates how to create human-readable tags for specific versions of a Lance dataset. Tags provide a convenient way to mark and reference important dataset states, such as 'stable' or 'nightly' builds, simplifying version management.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_9

LANGUAGE: python
CODE:
```
dataset.tags.create("stable", 2)
dataset.tags.create("nightly", 3)
dataset.tags.list()
```

----------------------------------------

TITLE: Run LanceDB code formatters
DESCRIPTION: Applies code formatting rules to the entire project. Specific commands like `make format-python` or `cargo fmt` can be used for language-specific formatting.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_4

LANGUAGE: shell
CODE:
```
make format
```

----------------------------------------

TITLE: Build and Search HNSW Index for Vector Similarity in Rust
DESCRIPTION: This Rust code provides a complete example for vector similarity search. It defines a `ground_truth` function for L2 distance calculation, `create_test_vector_dataset` to generate synthetic fixed-size list vectors, and a `main` function that orchestrates the process. The `main` function generates or loads a dataset, builds an HNSW index using `lance_index::vector::hnsw`, and then performs vector searches, measuring construction and search times, and calculating recall against ground truth.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/hnsw.md#_snippet_0

LANGUAGE: Rust
CODE:
```
use std::collections::HashSet;
use std::sync::Arc;

use arrow::array::{types::Float32Type, Array, FixedSizeListArray};
use arrow::array::{AsArray, FixedSizeListBuilder, Float32Builder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchIterator;
use arrow_select::concat::concat;
use futures::stream::StreamExt;
use lance::Dataset;
use lance_index::vector::v3::subindex::IvfSubIndex;
use lance_index::vector::{
    flat::storage::FlatFloatStorage,
    hnsw::{builder::HnswBuildParams, HNSW},
};
use lance_linalg::distance::DistanceType;

fn ground_truth(fsl: &FixedSizeListArray, query: &[f32], k: usize) -> HashSet<u32> {
    let mut dists = vec![];
    for i in 0..fsl.len() {
        let dist = lance_linalg::distance::l2_distance(
            query,
            fsl.value(i).as_primitive::<Float32Type>().values(),
        );
        dists.push((dist, i as u32));
    }
    dists.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap());
    dists.truncate(k);
    dists.into_iter().map(|(_, i)| i).collect()
}

pub async fn create_test_vector_dataset(output: &str, num_rows: usize, dim: i32) {
    let schema = Arc::new(Schema::new(vec![Field::new(
        "vector",
        DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), dim),
        false,
    )]));

    let mut batches = Vec::new();

    // Create a few batches
    for _ in 0..2 {
        let v_builder = Float32Builder::new();
        let mut list_builder = FixedSizeListBuilder::new(v_builder, dim);

        for _ in 0..num_rows {
            for _ in 0..dim {
                list_builder.values().append_value(rand::random::<f32>());
            }
            list_builder.append(true);
        }
        let array = Arc::new(list_builder.finish());
        let batch = RecordBatch::try_new(schema.clone(), vec![array]).unwrap();
        batches.push(batch);
    }
    let batch_reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema.clone());
    println!("Writing dataset to {}", output);
    Dataset::write(batch_reader, output, None).await.unwrap();
}

#[tokio::main]
async fn main() {
    let uri: Option<String> = None; // None means generate test data
    let column = "vector";
    let ef = 100;
    let max_edges = 30;
    let max_level = 7;

    // 1. Generate a synthetic test data of specified dimensions
    let dataset = if uri.is_none() {
        println!("No uri is provided, generating test dataset...");
        let output = "test_vectors.lance";
        create_test_vector_dataset(output, 1000, 64).await;
        Dataset::open(output).await.expect("Failed to open dataset")
    } else {
        Dataset::open(uri.as_ref().unwrap())
            .await
            .expect("Failed to open dataset")
    };

    println!("Dataset schema: {:#?}", dataset.schema());
    let batches = dataset
        .scan()
        .project(&[column])
        .unwrap()
        .try_into_stream()
        .await
        .unwrap()
        .then(|batch| async move { batch.unwrap().column_by_name(column).unwrap().clone() })
        .collect::<Vec<_>>()
        .await;
    let arrs = batches.iter().map(|b| b.as_ref()).collect::<Vec<_>>();
    let fsl = concat(&arrs).unwrap().as_fixed_size_list().clone();
    println!("Loaded {:?} batches", fsl.len());

    let vector_store = Arc::new(FlatFloatStorage::new(fsl.clone(), DistanceType::L2));

    let q = fsl.value(0);
    let k = 10;
    let gt = ground_truth(&fsl, q.as_primitive::<Float32Type>().values(), k);

    for ef_construction in [15, 30, 50] {
        let now = std::time::Instant::now();
        // 2. Build a hierarchical graph structure for efficient vector search using Lance API
        let hnsw = HNSW::index_vectors(
            vector_store.as_ref(),
            HnswBuildParams::default()
                .max_level(max_level)
                .num_edges(max_edges)
                .ef_construction(ef_construction),
        )
        .unwrap();
        let construct_time = now.elapsed().as_secs_f32();
        let now = std::time::Instant::now();
        // 3. Perform vector search with different parameters and compute the ground truth using L2 distance search
        let results: HashSet<u32> = hnsw
            .search_basic(q.clone(), k, ef, None, vector_store.as_ref())
            .unwrap()
            .iter()
            .map(|node| node.id)
            .collect();
        let search_time = now.elapsed().as_micros();
        println!(
            "level={}, ef_construct={}, ef={} recall={}: construct={:.3}s search={:.3} us",
            max_level,
            ef_construction,
            ef,
            results.intersection(&gt).count() as f32 / k as f32,
            construct_time,
            search_time
        );
    }
}
```

----------------------------------------

TITLE: LanceDB Nearest Neighbor Search Parameters
DESCRIPTION: This section details the parameters available for tuning nearest neighbor searches in LanceDB, including 'q', 'k', 'nprobes', and 'refine_factor'.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_19

LANGUAGE: APIDOC
CODE:
```
"nearest": {
  "column": "string", // Name of the vector column
  "q": "vector",      // The query vector for nearest neighbor search
  "k": "integer",     // The number of nearest neighbors to return
  "nprobes": "integer", // How many IVF partitions to search
  "refine_factor": "integer" // Controls re-ranking: if k=10 and refine_factor=5, retrieves 50 nearest neighbors by ANN and re-sorts using actual distances, then returns top 10. Improves recall without sacrificing performance too much.
}
```

----------------------------------------

TITLE: Install Lance Python Library
DESCRIPTION: Installs the stable release of the `pylance` library using pip, providing access to Lance's functionalities in Python.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_0

LANGUAGE: shell
CODE:
```
pip install pylance
```

----------------------------------------

TITLE: Convert Parquet Dataset to Lance
DESCRIPTION: This snippet demonstrates the straightforward conversion of an existing PyArrow Parquet dataset into a Lance dataset. It uses `lance.write_dataset` to perform the conversion and then verifies the content of the newly created Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_4

LANGUAGE: python
CODE:
```
dataset = lance.write_dataset(parquet, "/tmp/test.lance")

# Make sure it's the same
dataset.to_table().to_pandas()
```

----------------------------------------

TITLE: Convert Parquet dataset to Lance dataset
DESCRIPTION: Converts an existing PyArrow Parquet dataset directly into a Lance dataset in a single line of code, demonstrating seamless integration.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_4

LANGUAGE: Python
CODE:
```
dataset = lance.write_dataset(parquet, "/tmp/test.lance")
```

----------------------------------------

TITLE: Compare LanceDB benchmarks against previous version
DESCRIPTION: Provides a sequence of commands to compare the performance of the current version against the `main` branch. This involves saving a baseline from `main` and then comparing the current branch's performance against it.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_15

LANGUAGE: shell
CODE:
```
CURRENT_BRANCH=$(git branch --show-current)
```

LANGUAGE: shell
CODE:
```
git checkout main
```

LANGUAGE: shell
CODE:
```
maturin develop --profile release-with-debug  --features datagen
```

LANGUAGE: shell
CODE:
```
pytest --benchmark-save=baseline python/benchmarks -m "not slow"
```

LANGUAGE: shell
CODE:
```
COMPARE_ID=$(ls .benchmarks/*/ | tail -1 | cut -c1-4)
```

LANGUAGE: shell
CODE:
```
git checkout $CURRENT_BRANCH
```

LANGUAGE: shell
CODE:
```
maturin develop --profile release-with-debug  --features datagen
```

LANGUAGE: shell
CODE:
```
pytest --benchmark-compare=$COMPARE_ID python/benchmarks -m "not slow"
```

----------------------------------------

TITLE: Build Rust core format (debug)
DESCRIPTION: This command compiles the Rust core format in debug mode. The debug build includes debugging information and is suitable for development and testing, though it is not optimized for performance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_4

LANGUAGE: bash
CODE:
```
cargo build
```

----------------------------------------

TITLE: Download and extract SIFT 1M dataset for vector operations
DESCRIPTION: Removes any existing SIFT files and then downloads the `sift.tar.gz` archive from the specified FTP server. Finally, it extracts the contents of the tarball, preparing the SIFT 1M dataset for vector processing.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_14

LANGUAGE: Bash
CODE:
```
!rm -rf sift* vec_data.lance
!wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
!tar -xzf sift.tar.gz
```

----------------------------------------

TITLE: Format and lint Rust code
DESCRIPTION: These commands are used to automatically format Rust code according to community standards (`cargo fmt`) and to perform static analysis for potential issues (`cargo clippy`). This ensures code quality and consistency.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_3

LANGUAGE: bash
CODE:
```
cargo fmt --all
cargo clippy --all-features --tests --benches
```

----------------------------------------

TITLE: Run a specific LanceDB benchmark by name
DESCRIPTION: Filters and runs a particular benchmark using pytest's `-k` flag, allowing substring matching for the benchmark name.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_13

LANGUAGE: shell
CODE:
```
pytest python/benchmarks -k test_ivf_pq_index_search
```

----------------------------------------

TITLE: Run LanceDB code linters
DESCRIPTION: Executes code linters to check for style violations and potential issues. Language-specific linting can be performed with `make lint-python` or `make lint-rust`.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_5

LANGUAGE: shell
CODE:
```
make lint
```

----------------------------------------

TITLE: Verify converted Lance dataset content
DESCRIPTION: Reads the newly created Lance dataset and converts it back to a Pandas DataFrame to confirm that the data was correctly written and matches the original content.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_5

LANGUAGE: Python
CODE:
```
# make sure it's the same
dataset.to_table().to_pandas()
```

----------------------------------------

TITLE: Prepare Dbpedia-entities-openai Dataset
DESCRIPTION: This snippet provides shell commands to set up a Python virtual environment, install necessary dependencies from 'requirements.txt', and then generate the Dbpedia-entities-openai dataset in Lance format using 'datagen.py'. It requires Python 3.10 or newer.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/dbpedia-openai/README.md#_snippet_0

LANGUAGE: sh
CODE:
```
# Python 3.10+
python3 -m venv venv
. ./venv/bin/activate

# install dependencies
pip install -r requirements.txt

# Generate dataset in lance format.
./datagen.py
```

----------------------------------------

TITLE: Clean LanceDB build artifacts
DESCRIPTION: Removes all generated build artifacts and temporary files from the project directory, useful for a clean rebuild.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_9

LANGUAGE: shell
CODE:
```
make clean
```

----------------------------------------

TITLE: Query Nearest Neighbors with Specific Features
DESCRIPTION: Performs a nearest neighbor search while simultaneously retrieving specific feature columns ('revenue') alongside the vector results. This demonstrates fetching combined data in a single call.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_25

LANGUAGE: python
CODE:
```
sift1m.to_table(columns=["revenue"], nearest={"column": "vector", "q": samples[0], "k": 10}).to_pandas()
```

----------------------------------------

TITLE: Create named tags for Lance dataset versions
DESCRIPTION: Assigns human-readable tags ('stable', 'nightly') to specific versions (2 and 3) of the Lance dataset. Then, it lists all defined tags, providing aliases for version numbers.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_12

LANGUAGE: Python
CODE:
```
dataset.tags.create("stable", 2)
dataset.tags.create("nightly", 3)
dataset.tags.list()
```

----------------------------------------

TITLE: Access Lance dataset using a named tag
DESCRIPTION: Loads the Lance dataset by referencing a previously created tag ('stable') instead of a version number, and converts it to a Pandas DataFrame, showcasing tag-based version access.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_13

LANGUAGE: Python
CODE:
```
lance.dataset('/tmp/test.lance', version="stable").to_table().to_pandas()
```

----------------------------------------

TITLE: Run LanceDB benchmarks (excluding slow tests)
DESCRIPTION: Executes the performance benchmarks located in `python/benchmarks`, skipping tests explicitly marked as 'slow'. These benchmarks are designed for quick iteration and regression catching.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_11

LANGUAGE: shell
CODE:
```
pytest python/benchmarks -m "not slow"
```

----------------------------------------

TITLE: Verify overwritten Lance dataset content
DESCRIPTION: Reads the current state of the Lance dataset and converts it to a Pandas DataFrame to confirm that the overwrite operation was successful and the dataset now contains the new data.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_8

LANGUAGE: Python
CODE:
```
dataset.to_table().to_pandas()
```

----------------------------------------

TITLE: Rust: Load Tokenizer from Hugging Face Hub
DESCRIPTION: This function provides a utility to load a tokenizer from the Hugging Face Hub. It takes a model name, creates an API client, retrieves the tokenizer file from the specified repository, and constructs a `Tokenizer` object from it. This is a common pattern for integrating Hugging Face models.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_3

LANGUAGE: Rust
CODE:
```
fn load_tokenizer(model_name: &str) -> Result<Tokenizer, Box<dyn Error + Send + Sync>> {
    let api = Api::new()?;
    let repo = api.repo(Repo::with_revision(
        model_name.into(),
        RepoType::Model,
        "main".into(),
    ));

    let tokenizer_path = repo.get("tokenizer.json")?;
    let tokenizer = Tokenizer::from_file(tokenizer_path)?;

    Ok(tokenizer)
}
```

----------------------------------------

TITLE: Sample query vectors from Lance dataset using DuckDB
DESCRIPTION: Imports `duckdb` and queries the `sift1m` Lance dataset to sample 100 vectors from the 'vector' column. The sampled vectors are converted to a Pandas DataFrame column, to be used as query inputs for KNN search.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_17

LANGUAGE: Python
CODE:
```
import duckdb
# if this segfaults make sure duckdb v0.7+ is installed
samples = duckdb.query("SELECT vector FROM sift1m USING SAMPLE 100").to_df().vector
samples
```

----------------------------------------

TITLE: Prepare Parquet file for conversion to Lance
DESCRIPTION: Cleans up previous test files. Converts the Pandas DataFrame `df` to a PyArrow Table, then writes it to a Parquet file. Finally, it reads the Parquet file back into a PyArrow dataset and converts it to a Pandas DataFrame for display.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_3

LANGUAGE: Python
CODE:
```
shutil.rmtree("/tmp/test.parquet", ignore_errors=True)
shutil.rmtree("/tmp/test.lance", ignore_errors=True)

tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, "/tmp/test.parquet", format='parquet')

parquet = pa.dataset.dataset("/tmp/test.parquet")
parquet.to_table().to_pandas()
```

----------------------------------------

TITLE: Access a specific historical version of Lance dataset (Version 2)
DESCRIPTION: Loads another specific historical version (version 2) of the Lance dataset and converts it to a Pandas DataFrame, further illustrating the versioning capabilities.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_11

LANGUAGE: Python
CODE:
```
lance.dataset('/tmp/test.lance', version=2).to_table().to_pandas()
```

----------------------------------------

TITLE: Lance I/O Trace Events
DESCRIPTION: Describes events emitted during significant I/O operations, particularly those related to indices, useful for debugging cache utilization.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/performance.md#_snippet_1

LANGUAGE: APIDOC
CODE:
```
Event: lance::io_events
  Parameter: type
    Description: The type of I/O operation (open_scalar_index, open_vector_index, load_vector_part, load_scalar_part)
```

----------------------------------------

TITLE: Import libraries and define dataset paths for Flickr8k
DESCRIPTION: This snippet imports essential Python libraries such as `os`, `cv2`, `lance`, `pyarrow`, `matplotlib`, and `tqdm`. It also defines the file paths for the Flickr8k captions file and the image dataset folder, which are crucial for subsequent data processing. It assumes the dataset and required libraries like pyarrow, pylance, opencv, and tqdm are already installed and present.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_0

LANGUAGE: python
CODE:
```
import os
import cv2
import random

import lance
import pyarrow as pa

import matplotlib.pyplot as plt

from tqdm.auto import tqdm

captions = "Flickr8k.token.txt"
image_folder = "Flicker8k_Dataset/"
```

----------------------------------------

TITLE: Build IVF_PQ index on Lance vector dataset
DESCRIPTION: Builds an IVF_PQ (Inverted File Index with Product Quantization) index on the 'vector' column of the `sift1m` dataset. It configures the index with 256 partitions and 16 sub-vectors for efficient approximate nearest neighbor search, significantly speeding up vector queries.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_19

LANGUAGE: Python
CODE:
```
%%time

sift1m.create_index(
    "vector",
    index_type="IVF_PQ", # IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ are supported
    num_partitions=256,  # IVF
    num_sub_vectors=16  # PQ
)
```

----------------------------------------

TITLE: Python Environment Setup for LanceDB Testing
DESCRIPTION: Sets up the Python environment by ensuring the project's root directory is added to sys.path and preventing bytecode generation. This is crucial for module imports within the project structure.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_0

LANGUAGE: python
CODE:
```
import sys
sys.dont_write_bytecode = True

import os

module_path = os.path.abspath(os.path.join('.'))
if module_path not in sys.path:
    sys.path.append(module_path)
```

----------------------------------------

TITLE: Add Metadata Columns to Lance Table
DESCRIPTION: Appends new feature columns, 'item_id' and 'revenue', to an existing Lance table. This illustrates how to enrich dataset entries with additional metadata before writing them back.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_23

LANGUAGE: python
CODE:
```
tbl = sift1m.to_table()
tbl = tbl.append_column("item_id", pa.array(range(len(tbl))))
tbl = tbl.append_column("revenue", pa.array((np.random.randn(len(tbl))+5)*1000))
tbl.to_pandas()
```

----------------------------------------

TITLE: Build MacOS x86_64 Wheels
DESCRIPTION: This command builds release-mode wheels specifically for x86_64 MacOS. It uses `maturin` to compile the project for the `x86_64-apple-darwin` target, storing the resulting wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_26

LANGUAGE: Shell
CODE:
```
maturin build --release \
    --target x86_64-apple-darwin \
    --out wheels
```

----------------------------------------

TITLE: Overwrite Lance dataset to create new version
DESCRIPTION: Creates a new Pandas DataFrame with different data. Converts it to a PyArrow Table and overwrites the existing Lance dataset at `/tmp/test.lance` using `mode="overwrite"`, effectively creating a new version of the dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_7

LANGUAGE: Python
CODE:
```
df = pd.DataFrame({"a": [50, 100]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="overwrite")
```

----------------------------------------

TITLE: Run Dbpedia-entities-openai Benchmark
DESCRIPTION: This command executes the 'benchmarks.py' script to run top-k vector queries. The script tests various combinations of IVF and PQ values, as well as 'refine_factor', to evaluate performance. The example specifies a top-k value of 20.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/dbpedia-openai/README.md#_snippet_1

LANGUAGE: sh
CODE:
```
./benchmarks.py -k 20
```

----------------------------------------

TITLE: Build and Test Lance Rust Package
DESCRIPTION: These commands clone the Lance repository, navigate to the Rust directory, and then build, test, and benchmark the core Rust components of Lance.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_2

LANGUAGE: bash
CODE:
```
git checkout https://github.com/lancedb/lance.git

# Build rust package
cd rust
cargo build

# Run test
cargo test

# Run benchmarks
cargo bench
```

----------------------------------------

TITLE: Query Lance Dataset with Simple SQL in Python DataFusion
DESCRIPTION: This Python example shows how to integrate Lance datasets with DataFusion using `FFILanceTableProvider` from `pylance`. It demonstrates registering a Lance dataset as a table and executing a basic SQL `SELECT` query to fetch the first 10 rows, highlighting the Python FFI integration.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_2

LANGUAGE: python
CODE:
```
from datafusion import SessionContext # pip install datafusion
from lance import FFILanceTableProvider

ctx = SessionContext()

table1 = FFILanceTableProvider(
    my_lance_dataset, with_row_id=True, with_row_addr=True
)
ctx.register_table_provider("table1", table1)
ctx.table("table1")
ctx.sql("SELECT * FROM table1 LIMIT 10")
```

----------------------------------------

TITLE: Open a LanceDB Dataset
DESCRIPTION: Provides a basic example of how to open an existing Lance dataset using the `lance.dataset` function. This function can be used to access datasets stored locally or in cloud storage like S3.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_11

LANGUAGE: python
CODE:
```
import lance
ds = lance.dataset("s3://bucket/path/imagenet.lance")
```

----------------------------------------

TITLE: Build LanceDB in development mode
DESCRIPTION: Builds the Rust native module in place using `maturin`. This command needs to be re-run whenever Rust code changes, but is not required for Python code modifications.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_0

LANGUAGE: shell
CODE:
```
maturin develop
```

----------------------------------------

TITLE: Lance File Audit Trace Events
DESCRIPTION: Details the events emitted when significant files are created or deleted in Lance, including the mode of I/O operation and the type of file affected.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/performance.md#_snippet_0

LANGUAGE: APIDOC
CODE:
```
Event: lance::file_audit
  Parameter: mode
    Description: The mode of I/O operation (create, delete, delete_unverified)
  Parameter: type
    Description: The type of file affected (manifest, data file, index file, deletion file)
```

----------------------------------------

TITLE: Download Lindera Language Model
DESCRIPTION: Command-line instruction to download a specific Lindera language model (e.g., ipadic, ko-dic, unidic) for LanceDB. Note that `lindera-cli` must be installed beforehand as Lindera models require compilation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_4

LANGUAGE: bash
CODE:
```
python -m lance.download lindera -l [ipadic|ko-dic|unidic]
```

----------------------------------------

TITLE: Access a specific historical version of Lance dataset (Version 1)
DESCRIPTION: Loads a specific historical version (version 1) of the Lance dataset and converts it to a Pandas DataFrame, demonstrating the ability to revert to or inspect past states of the data.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_10

LANGUAGE: Python
CODE:
```
lance.dataset('/tmp/test.lance', version=1).to_table().to_pandas()
```

----------------------------------------

TITLE: Decorate Rust Unit Test for Tracing
DESCRIPTION: To enable tracing for a Rust unit test, decorate it with the `#[lance_test_macros::test]` attribute. This macro wraps any existing test attributes, allowing tracing information to be collected during test execution.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_16

LANGUAGE: Rust
CODE:
```
#[lance_test_macros::test(tokio::test)]
async fn test() {
    ...
}
```

----------------------------------------

TITLE: Add Rust Toolchain Targets for Cross-Compilation
DESCRIPTION: To build manylinux wheels for different Linux architectures, you must first add the corresponding Rust toolchain targets. These commands add the x86_64 and aarch64 GNU targets, enabling cross-compilation.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_22

LANGUAGE: Shell
CODE:
```
rustup target add x86_64-unknown-linux-gnu
rustup target add aarch64-unknown-linux-gnu
```

----------------------------------------

TITLE: Query Vectors and Metadata Together in LanceDB
DESCRIPTION: This code demonstrates how to perform a nearest neighbor search in LanceDB while simultaneously retrieving specified metadata columns. It allows users to fetch both vector embeddings and associated feature data ('item_id', 'revenue') in a single query, streamlining data retrieval for applications requiring both.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_21

LANGUAGE: python
CODE:
```
result = sift1m.to_table(
    columns=["item_id", "revenue"],
    nearest={"column": "vector", "q": samples[0], "k": 10}
)
print(result.to_pandas())
```

----------------------------------------

TITLE: Build MacOS ARM64 Wheels
DESCRIPTION: This command builds release-mode wheels specifically for ARM64 (aarch64) MacOS. It uses `maturin` to compile the project for the `aarch64-apple-darwin` target, storing the resulting wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_25

LANGUAGE: Shell
CODE:
```
maturin build --release \
    --target aarch64-apple-darwin \
    --out wheels
```

----------------------------------------

TITLE: Rust: WikiTextBatchReader Next Batch Logic
DESCRIPTION: This snippet shows the core logic for the `next` method of the `WikiTextBatchReader`. It attempts to build and retrieve the next Parquet reader from a list of available readers. If a reader is successfully built, it's used; otherwise, it handles errors or indicates that no more readers are available.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_1

LANGUAGE: Rust
CODE:
```
                if let Some(builder) = self.parquet_readers[self.current_reader_idx].take() {
                    match builder.build() {
                        Ok(reader) => {
                            self.current_reader = Some(Box::new(reader));
                            self.current_reader_idx += 1;
                            continue;
                        }
                        Err(e) => {
                            return Some(Err(arrow::error::ArrowError::ExternalError(Box::new(e))))
                        }
                    }
                }
            }

            // No more readers available
            return None;
        }
```

----------------------------------------

TITLE: Download and Extract SIFT1M Dataset
DESCRIPTION: Downloads the SIFT1M dataset, a common benchmark for vector search, and extracts its contents. This is a prerequisite step for running the subsequent vector search examples.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_6

LANGUAGE: shell
CODE:
```
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz
```

----------------------------------------

TITLE: Measure Nearest Neighbor Query Performance
DESCRIPTION: Performs multiple nearest neighbor queries on the Lance dataset using a list of sample vectors and measures the average query time. It also prints the resulting table for the last query.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_21

LANGUAGE: python
CODE:
```
import time

tot = 0
for q in samples:
    start = time.time()
    tbl = sift1m.to_table(nearest={"column": "vector", "q": q, "k": 10})
    end = time.time()
    tot += (end - start)

print(f"Avg(sec): {tot / len(samples)}")
print(tbl.to_pandas())
```

----------------------------------------

TITLE: Run Rust Unit Test with Tracing Verbosity
DESCRIPTION: Execute a Rust unit test with tracing enabled by setting the `LANCE_TESTING` environment variable to a desired verbosity level (e.g., 'debug', 'info'). This command will generate a JSON trace file in your working directory, which can be viewed in Chrome or Perfetto.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_17

LANGUAGE: Bash
CODE:
```
LANCE_TESTING=debug cargo test dataset::tests::test_create_dataset
```

----------------------------------------

TITLE: Build Linux x86_64 Manylinux Wheels
DESCRIPTION: This command builds release-mode manylinux wheels for x86_64 Linux. It utilizes `maturin` with `zig` for cross-compilation, targeting `manylinux2014` compatibility, and outputs the generated wheels to the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_23

LANGUAGE: Shell
CODE:
```
maturin build --release --zig \
    --target x86_64-unknown-linux-gnu \
    --compatibility manylinux2014 \
    --out wheels
```

----------------------------------------

TITLE: Append new rows to an existing Lance dataset
DESCRIPTION: Creates a new Pandas DataFrame with a single row. Converts it to a PyArrow Table and appends it to the existing Lance dataset at `/tmp/test.lance` using `mode="append"`.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_6

LANGUAGE: Python
CODE:
```
df = pd.DataFrame({"a": [10]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="append")

dataset.to_table().to_pandas()
```

----------------------------------------

TITLE: Overwrite Lance Dataset
DESCRIPTION: This snippet demonstrates how to completely overwrite the data in a Lance dataset, effectively creating a new version. A new Pandas DataFrame is prepared and written to the dataset using `mode="overwrite"`, replacing the previous content while preserving the old version for historical access.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_6

LANGUAGE: python
CODE:
```
df = pd.DataFrame({"a": [50, 100]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="overwrite")

dataset.to_table().to_pandas()
```

----------------------------------------

TITLE: Lance Execution Trace Events
DESCRIPTION: Outlines events emitted when an execution plan is run, providing insights into query performance, including output rows, I/O operations, bytes read, and index statistics.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/performance.md#_snippet_2

LANGUAGE: APIDOC
CODE:
```
Event: lance::execution
  Parameter: type
    Description: The type of execution event (plan_run is the only type today)
  Parameter: output_rows
    Description: The number of rows in the output of the plan
  Parameter: iops
    Description: The number of I/O operations performed by the plan
  Parameter: bytes_read
    Description: The number of bytes read by the plan
  Parameter: indices_loaded
    Description: The number of indices loaded by the plan
  Parameter: parts_loaded
    Description: The number of index partitions loaded by the plan
  Parameter: index_comparisons
    Description: The number of comparisons performed inside the various indices
```

----------------------------------------

TITLE: Example Console Output of CLIP Model Training Progress
DESCRIPTION: This snippet shows a typical console output during the training of the CLIP model. It displays the epoch number, the progress bar indicating batch processing, and the reported loss value for each epoch, demonstrating the training's progression and convergence.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_10

LANGUAGE: console
CODE:
```
==================== Epoch: 1 / 2 ====================
loss: 2.0799: 100%|██████████| 253/253 [02:14<00:00,  1.88it/s]

==================== Epoch: 2 / 2 ====================
loss: 1.3064: 100%|██████████| 253/253 [02:10<00:00,  1.94it/s]
```

----------------------------------------

TITLE: Convert SIFT Data to Lance Vector Dataset
DESCRIPTION: This code demonstrates how to convert the raw SIFT 1M dataset, stored in a binary format, into a Lance vector dataset. It involves reading the binary data, reshaping it into a NumPy array, and then using `vec_to_table` and `lance.write_dataset` to store it efficiently for vector search.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_12

LANGUAGE: python
CODE:
```
from lance.vector import vec_to_table
import struct

uri = "vec_data.lance"

with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * 1000000 * 128])).reshape((1000000, 128))
    dd = dict(zip(range(1000000), data))

table = vec_to_table(dd)
lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)
```

----------------------------------------

TITLE: Perform KNN Search on Lance Dataset (No Index)
DESCRIPTION: This snippet illustrates how to perform a K-Nearest Neighbors (KNN) search on a Lance dataset without utilizing an index. It measures the execution time to highlight the performance implications of a full dataset scan, demonstrating the need for ANN indexes in real-time scenarios.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_15

LANGUAGE: python
CODE:
```
import time

start = time.time()
tbl = sift1m.to_table(columns=["id"], nearest={"column": "vector", "q": samples[0], "k": 10})
end = time.time()

print(f"Time(sec): {end-start}")
print(tbl.to_pandas())
```

----------------------------------------

TITLE: Build Linux ARM64 Manylinux Wheels
DESCRIPTION: This command builds release-mode manylinux wheels for ARM64 (aarch64) Linux. It uses `maturin` with `zig` for cross-compilation, targeting `manylinux2014` compatibility, and places the output wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_24

LANGUAGE: Shell
CODE:
```
maturin build --release --zig \
    --target aarch_64-unknown-linux-gnu \
    --compatibility manylinux2014 \
    --out wheels
```

----------------------------------------

TITLE: Overwrite Lance Dataset with New Features
DESCRIPTION: Writes the modified table, including newly added feature columns, back to the Lance dataset URI, overwriting the existing dataset. This updates the dataset with enriched data.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_24

LANGUAGE: python
CODE:
```
sift1m = lance.write_dataset(tbl, uri, mode="overwrite")
```

----------------------------------------

TITLE: Append Metadata Columns to LanceDB Dataset
DESCRIPTION: This Python snippet illustrates how to append additional metadata columns, such as 'item_id' and 'revenue', to an existing LanceDB dataset. This allows for storing and managing feature data alongside vector embeddings within the same dataset, simplifying data management.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_20

LANGUAGE: python
CODE:
```
tbl = sift1m.to_table()
tbl = tbl.append_column("item_id", pa.array(range(len(tbl))))
tbl = tbl.append_column("revenue", pa.array((np.random.randn(len(tbl))+5)*1000))
```

----------------------------------------

TITLE: Create Vector Index in LanceDB (IVF_PQ)
DESCRIPTION: This code demonstrates how to create a vector index on a LanceDB dataset. It specifies the vector column, index type (IVF_PQ, IVF_HNSW_PQ, IVF_HNSW_SQ are supported), number of partitions for IVF, and number of sub-vectors for PQ. This improves the efficiency of Approximate Nearest Neighbor (ANN) searches.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_16

LANGUAGE: python
CODE:
```
sift1m.create_index(
    "vector",
    index_type="IVF_PQ", # IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ are supported
    num_partitions=256,  # IVF
    num_sub_vectors=16,  # PQ
)
```

----------------------------------------

TITLE: Convert SIFT FVECS data to Lance vector dataset
DESCRIPTION: Imports `vec_to_table` from `lance.vector` and `struct`. Reads the SIFT base vectors from `sift_base.fvecs`, unpacks the binary data into a NumPy array, and converts it into a PyArrow Table using `vec_to_table`. Finally, it writes this table to a Lance dataset named `vec_data.lance`, optimizing for vector storage.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_15

LANGUAGE: Python
CODE:
```
from lance.vector import vec_to_table
import struct

uri = "vec_data.lance"

with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * 1000000 * 128])).reshape((1000000, 128))
    dd = dict(zip(range(1000000), data))

table = vec_to_table(dd)
lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)
```

----------------------------------------

TITLE: Read Lance Dataset in Java
DESCRIPTION: This Java snippet demonstrates how to open and access an existing Lance dataset. It uses `Dataset.open` with the dataset's path and a `BufferAllocator` to load the dataset. Once opened, it shows how to retrieve basic information such as row count, schema, and version details, providing a starting point for data querying and manipulation.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_3

LANGUAGE: Java
CODE:
```
void readDataset() {
    String datasetPath = ""; // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator()) {
        try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
            dataset.countRows();
            dataset.getSchema();
            dataset.version();
            dataset.latestVersion();
            // access more information
        }
    }
}
```

----------------------------------------

TITLE: Execute Python S3 Integration Tests
DESCRIPTION: Once local S3 services are running, this command executes the Python S3 integration tests using `pytest`. The `--run-integration` flag ensures that tests requiring external services are included, specifically targeting the `test_s3_ddb.py` file.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_21

LANGUAGE: Shell
CODE:
```
pytest --run-integration python/tests/test_s3_ddb.py
```

----------------------------------------

TITLE: Perform Random Access on Lance Dataset in Java
DESCRIPTION: This Java example demonstrates how to perform random access queries on a Lance dataset, retrieving specific rows and columns. It opens an existing dataset, specifies a list of row indices and desired column names, and then uses `dataset.take` to fetch the corresponding data. The results are processed using an `ArrowReader` to iterate through batches and access individual field values.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_5

LANGUAGE: Java
CODE:
```
void randomAccess() {
    String datasetPath = ""; // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator()) {
        try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
            List<Long> indices = Arrays.asList(1L, 4L);
            List<String> columns = Arrays.asList("id", "name");
            try (ArrowReader reader = dataset.take(indices, columns)) {
                while (reader.loadNextBatch()) {
                    VectorSchemaRoot result = reader.getVectorSchemaRoot();
                    result.getRowCount();

                    for (int i = 0; i < indices.size(); i++) {
                        result.getVector("id").getObject(i);
                        result.getVector("name").getObject(i);
                    }
                }
            }
        }
    }
}
```

----------------------------------------

TITLE: Load Subset of Lance Dataset with Projection and Predicates
DESCRIPTION: This Python example illustrates how to efficiently load a subset of a Lance dataset into memory. It utilizes column projection (`columns`), filter push-down (`filter`), and pagination (`limit`, `offset`) to optimize data retrieval for large datasets by reducing I/O.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_14

LANGUAGE: python
CODE:
```
table = ds.to_table(
    columns=["image", "label"],
    filter="label = 2 AND text IS NOT NULL",
    limit=1000,
    offset=3000)
```

----------------------------------------

TITLE: Create PyTorch DataLoader from LanceDataset (Unsafe)
DESCRIPTION: This example shows how to load a Lance dataset into a PyTorch `IterableDataset` using `lance.torch.data.LanceDataset` and then create a standard PyTorch `DataLoader`. It highlights an inference loop, but notes that this approach is not fork-safe for multiprocessing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_1

LANGUAGE: python
CODE:
```
import torch
import lance.torch.data

# Load lance dataset into a PyTorch IterableDataset.
# with only columns "image" and "prompt".
dataset = lance.torch.data.LanceDataset(
    "diffusiondb_train.lance",
    columns=["image", "prompt"],
    batch_size=128,
    batch_readahead=8,  # Control multi-threading reads.
)

# Create a PyTorch DataLoader
dataloader = torch.utils.data.DataLoader(dataset)

# Inference loop
for batch in dataloader:
    inputs, targets = batch["prompt"], batch["image"]
    outputs = model(inputs)
    ...
```

----------------------------------------

TITLE: Manage LanceDB Dataset Tags (Create, Update, Delete, List)
DESCRIPTION: This Python example demonstrates how to interact with `LanceDataset.tags` to manage dataset versions. It covers creating a tag for a specific version, updating its associated version, listing all tags, and finally deleting a tag. It also shows how `list_ordered()` can be used to retrieve tags in the order they were created or last updated.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tags.md#_snippet_0

LANGUAGE: python
CODE:
```
import lance
ds = lance.dataset("./tags.lance")
print(len(ds.versions()))
# 2
print(ds.tags.list())
# {}
ds.tags.create("v1-prod", 1)
print(ds.tags.list())
# {'v1-prod': {'version': 1, 'manifest_size': ...}}
ds.tags.update("v1-prod", 2)
print(ds.tags.list())
# {'v1-prod': {'version': 2, 'manifest_size': ...}}
ds.tags.delete("v1-prod")
print(ds.tags.list())
# {}
print(ds.tags.list_ordered())
# []
ds.tags.create("v1-prod", 1)
print(ds.tags.list_ordered())
# [('v1-prod', {'version': 1, 'manifest_size': ...})]
ds.tags.update("v1-prod", 2)
print(ds.tags.list_ordered())
# [('v1-prod', {'version': 2, 'manifest_size': ...})]
ds.tags.delete("v1-prod")
print(ds.tags.list_ordered())
# []
```

----------------------------------------

TITLE: Write Pandas DataFrame to Lance Dataset
DESCRIPTION: Removes any existing Lance dataset at `/tmp/test.lance` to ensure a clean write. Then, it writes the Pandas DataFrame `df` to a new Lance dataset and converts the resulting dataset back to a Pandas DataFrame for verification.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_2

LANGUAGE: Python
CODE:
```
shutil.rmtree("/tmp/test.lance", ignore_errors=True)

dataset = lance.write_dataset(df, "/tmp/test.lance")
dataset.to_table().to_pandas()
```

----------------------------------------

TITLE: Perform K-Nearest Neighbors search without an index
DESCRIPTION: Measures the time taken to perform a K-Nearest Neighbors (KNN) search on the `sift1m` dataset. It queries for the 10 nearest neighbors to the first sampled vector (`samples[0]`) based on the 'vector' column, demonstrating a full scan approach.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_18

LANGUAGE: Python
CODE:
```
import time

start = time.time()
tbl = sift1m.to_table(columns=["id"], nearest={"column": "vector", "q": samples[0], "k": 10})
end = time.time()

print(f"Time(sec): {end-start}")
print(tbl.to_pandas())
```

----------------------------------------

TITLE: Write Pandas DataFrame to Lance Dataset
DESCRIPTION: This snippet shows how to persist a Pandas DataFrame into a Lance dataset. It first ensures a clean state by removing any existing file and then uses `lance.write_dataset` to save the DataFrame, followed by reading it back to confirm the write operation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_2

LANGUAGE: python
CODE:
```
shutil.rmtree("/tmp/test.lance", ignore_errors=True)

dataset = lance.write_dataset(df, "/tmp/test.lance")
dataset.to_table().to_pandas()
```

----------------------------------------

TITLE: Join Multiple Lance Datasets with SQL in Rust DataFusion
DESCRIPTION: This Rust example illustrates how to register multiple Lance datasets (e.g., 'orders' and 'customers') as separate tables in DataFusion. It then performs a SQL `JOIN` operation between these tables to combine data based on a common key, demonstrating more complex query capabilities.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_1

LANGUAGE: rust
CODE:
```
use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;

let ctx = SessionContext::new();

ctx.register_table("orders",
    Arc::new(LanceTableProvider::new(
    Arc::new(orders_dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

ctx.register_table("customers",
    Arc::new(LanceTableProvider::new(
    Arc::new(customers_dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

let df = ctx.sql("
    SELECT o.order_id, o.amount, c.customer_name
    FROM orders o
    JOIN customers c ON o.customer_id = c.customer_id
    LIMIT 10
").await?;

let result = df.collect().await?;
```

----------------------------------------

TITLE: Read ImageURIs into Lance EncodedImageArray
DESCRIPTION: This example shows how to use `ImageURIArray.read_uris()` to load images referenced by URIs into memory. The method returns an `EncodedImageArray` containing the binary data of the images, enabling direct processing of image content.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_4

LANGUAGE: python
CODE:
```
from lance.arrow import ImageURIArray

relative_path = "images/1.png"
uris = [os.path.join(os.path.dirname(__file__), relative_path)]
ImageURIArray.from_uris(uris).read_uris()
# <lance.arrow.EncodedImageArray object at 0x...>
# [b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00...']
```

----------------------------------------

TITLE: Create and Write Lance Dataset from Arrow Stream in Java
DESCRIPTION: This Java example illustrates how to create a Lance dataset and populate it with data from an existing Arrow file. It reads bytes from a source path, converts them into an `ArrowArrayStream`, and then uses `Dataset.create` with `WriteParams` to configure writing options like `maxRowsPerFile`, `maxRowsPerGroup`, and `WriteMode`. This method is suitable for ingesting data from Arrow-formatted sources.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_2

LANGUAGE: Java
CODE:
```
void createAndWriteDataset() throws IOException, URISyntaxException {
    Path path = "";     // the original source path
    String datasetPath = "";    // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator();
        ArrowFileReader reader =
            new ArrowFileReader(
                new SeekableReadChannel(
                    new ByteArrayReadableSeekableByteChannel(Files.readAllBytes(path))), allocator);
        ArrowArrayStream arrowStream = ArrowArrayStream.allocateNew(allocator)) {
        Data.exportArrayStream(allocator, reader, arrowStream);
        try (Dataset dataset =
                     Dataset.create(
                             allocator,
                             arrowStream,
                             datasetPath,
                             new WriteParams.Builder()
                                     .withMaxRowsPerFile(10)
                                     .withMaxRowsPerGroup(20)
                                     .withMode(WriteParams.WriteMode.CREATE)
                                     .withStorageOptions(new HashMap<>())
                                     .build())) {
            // access dataset
        }
    }
}
```

----------------------------------------

TITLE: Generate Flame Graph from Process ID
DESCRIPTION: Generates a flame graph for a running process using its Process ID (PID). This command is used to capture and visualize CPU profiles, helping to identify performance bottlenecks in an application.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/Debug.md#_snippet_5

LANGUAGE: sh
CODE:
```
flamegraph -p <PID>
```

----------------------------------------

TITLE: Create Lance BFloat16 Arrow Array
DESCRIPTION: This example illustrates how to construct a `BFloat16Array` directly using the `lance.arrow.bfloat16_array` function. It takes a list of floating-point numbers and converts them into an Arrow array with BFloat16 precision, suitable for Arrow-based data processing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_1

LANGUAGE: python
CODE:
```
from lance.arrow import bfloat16_array

bfloat16_array([1.1, 2.1, 3.4])
# <lance.arrow.BFloat16Array object at 0x000000016feb94e0>
# [
#   1.1015625,
#   2.09375,
#   3.40625
# ]
```

----------------------------------------

TITLE: Clone LanceDB GitHub Repository
DESCRIPTION: Instructions to clone the LanceDB project repository from GitHub to a local machine. This is the first step for setting up the development environment.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_11

LANGUAGE: shell
CODE:
```
git clone https://github.com/lancedb/lance.git
```

----------------------------------------

TITLE: Rust Implementation of WikiTextBatchReader
DESCRIPTION: This Rust code defines `WikiTextBatchReader`, a custom implementation of `arrow::record_batch::RecordBatchReader`. It's designed to read text data from Parquet files, tokenize it using a `Tokenizer` from the `tokenizers` crate, and transform it into Arrow `RecordBatch`es. The `process_batch` method handles tokenization, limits the number of samples, and shuffles the tokenized IDs before creating the final `RecordBatch`.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_0

LANGUAGE: rust
CODE:
```
use arrow::array::{Array, Int64Builder, ListBuilder, UInt32Array};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchReader;
use futures::StreamExt;
use hf_hub::{api::sync::Api, Repo, RepoType};
use lance::dataset::WriteParams;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use rand::seq::SliceRandom;
use rand::SeedableRng;
use std::error::Error;
use std::fs::File;
use std::io::Write;
use std::sync::Arc;
use tempfile::NamedTempFile;
use tokenizers::Tokenizer;

// Implement a custom stream batch reader
struct WikiTextBatchReader {
    schema: Arc<Schema>,
    parquet_readers: Vec<Option<ParquetRecordBatchReaderBuilder<File>>>,
    current_reader_idx: usize,
    current_reader: Option<Box<dyn RecordBatchReader + Send>>,
    tokenizer: Tokenizer,
    num_samples: u64,
    cur_samples_cnt: u64,
}

impl WikiTextBatchReader {
    fn new(
        parquet_readers: Vec<ParquetRecordBatchReaderBuilder<File>>,
        tokenizer: Tokenizer,
        num_samples: Option<u64>,
    ) -> Result<Self, Box<dyn Error + Send + Sync>> {
        let schema = Arc::new(Schema::new(vec![Field::new(
            "input_ids",
            DataType::List(Arc::new(Field::new("item", DataType::Int64, true))),
            false,
        )]));

        Ok(Self {
            schema,
            parquet_readers: parquet_readers.into_iter().map(Some).collect(),
            current_reader_idx: 0,
            current_reader: None,
            tokenizer,
            num_samples: num_samples.unwrap_or(100_000),
            cur_samples_cnt: 0,
        })
    }

    fn process_batch(
        &mut self,
        input_batch: &RecordBatch,
    ) -> Result<RecordBatch, arrow::error::ArrowError> {
        let num_rows = input_batch.num_rows();
        let mut token_builder = ListBuilder::new(Int64Builder::with_capacity(num_rows * 1024)); // Pre-allocate space
        let mut should_break = false;

        let column = input_batch.column_by_name("text").unwrap();
        let string_array = column
            .as_any()
            .downcast_ref::<arrow::array::StringArray>()
            .unwrap();
        for i in 0..num_rows {
            if self.cur_samples_cnt >= self.num_samples {
                should_break = true;
                break;
            }
            if !Array::is_null(string_array, i) {
                let text = string_array.value(i);
                // Split paragraph into lines
                for line in text.split('
') {
                    if let Ok(encoding) = self.tokenizer.encode(line, true) {
                        let tb_values = token_builder.values();
                        for &id in encoding.get_ids() {
                            tb_values.append_value(id as i64);
                        }
                        token_builder.append(true);
                        self.cur_samples_cnt += 1;
                        if self.cur_samples_cnt % 5000 == 0 {
                            println!("Processed {} rows", self.cur_samples_cnt);
                        }
                        if self.cur_samples_cnt >= self.num_samples {
                            should_break = true;
                            break;
                        }
                    }
                }
            }
        }

        // Create array and shuffle it
        let input_ids_array = token_builder.finish();

        // Create shuffled array by randomly sampling indices
        let mut rng = rand::rngs::StdRng::seed_from_u64(1337);
        let len = input_ids_array.len();
        let mut indices: Vec<u32> = (0..len as u32).collect();
        indices.shuffle(&mut rng);

        // Take values in shuffled order
        let indices_array = UInt32Array::from(indices);
        let shuffled = arrow::compute::take(&input_ids_array, &indices_array, None)?;

        let batch = RecordBatch::try_new(self.schema.clone(), vec![Arc::new(shuffled)]);
        if should_break {
            println!("Stop at {} rows", self.cur_samples_cnt);
            self.parquet_readers.clear();
            self.current_reader = None;
        }

        batch
    }
}

impl RecordBatchReader for WikiTextBatchReader {
    fn schema(&self) -> Arc<Schema> {
        self.schema.clone()
    }
}

impl Iterator for WikiTextBatchReader {
    type Item = Result<RecordBatch, arrow::error::ArrowError>;
    fn next(&mut self) -> Option<Self::Item> {
        loop {
            // If we have a current reader, try to get next batch
            if let Some(reader) = &mut self.current_reader {
                if let Some(batch_result) = reader.next() {
                    return Some(batch_result.and_then(|batch| self.process_batch(&batch)));
                }
            }

            // If no current reader or current reader is exhausted, try to get next reader
            if self.current_reader_idx < self.parquet_readers.len() {
```

----------------------------------------

TITLE: Inefficient Row Update by Iteration
DESCRIPTION: Provides an example of an inefficient way to update multiple individual rows by iterating through a table and calling `update` for each row. It notes that a merge insert operation is generally more efficient for bulk updates.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_6

LANGUAGE: python
CODE:
```
import lance

# Change the ages of both Alice and Bob
new_table = pa.Table.from_pylist([{"name": "Alice", "age": 30},
                                  {"name": "Bob", "age": 20}])

# This works, but is inefficient, see below for a better approach
dataset = lance.dataset("./alice_and_bob.lance")
for idx in range(new_table.num_rows):
  name = new_table[0][idx].as_py()
  new_age = new_table[1][idx].as_py()
  dataset.update({"age": new_age}, where=f"name='{name}'")
```

----------------------------------------

TITLE: Generate and Merge Columns in Parallel with Ray and Lance
DESCRIPTION: This example illustrates how to generate new columns in parallel using Ray and Lance. It defines an Arrow schema, creates an initial dataset with 'id', 'height', and 'weight' columns, and then uses a custom Python function (`generate_labels`) to add a new 'size_labels' column based on existing 'height' data, demonstrating Lance's `add_columns` functionality for parallel processing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/ray.md#_snippet_1

LANGUAGE: python
CODE:
```
import pyarrow as pa
from pathlib import Path
import lance

# Define schema
schema = pa.schema([
    pa.field("id", pa.int64()),
    pa.field("height", pa.int64()),
    pa.field("weight", pa.int64()),
])

# Generate initial dataset
ds = (
    ray.data.range(10)  # Create 0-9 IDs
    .map(lambda x: {
        "id": x["id"],
        "height": x["id"] + 5,  # height = id + 5
        "weight": x["id"] * 2   # weight = id * 2
    })
    .write_lance(str(output_path), schema=schema)
)

# Define label generation logic
def generate_labels(batch: pa.RecordBatch) -> pa.RecordBatch:
    heights = batch.column("height").to_pylist()
    size_labels = ["tall" if h > 8 else "medium" if h > 6 else "short" for h in heights]
    return pa.RecordBatch.from_arrays([
        pa.array(size_labels)
    ], names=["size_labels"])

# Add new columns in parallel
lance_ds = lance.dataset(output_path)
add_columns(
    lance_ds,
    generate_labels,
    source_columns=["height"],  # Input columns needed
)

# Display final results
final_df = lance_ds.to_table().to_pandas()
print("\nEnhanced dataset with size labels:\n")
print(final_df.sort_values("id").to_string(index=False))
```

----------------------------------------

TITLE: Configure Python Benchmark for Single Iteration Tracing
DESCRIPTION: When tracing Python benchmarks, it's often useful to force them to run only once for sensible results. This snippet demonstrates how to use the `pedantic` API to limit a benchmark to a single iteration and round, ensuring a focused trace.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_19

LANGUAGE: Python
CODE:
```
def run():
    "Put code to benchmark here"
    ...
benchmark.pedantic(run, iterations=1, rounds=1)
```

----------------------------------------

TITLE: Enable Tracing for Python Script
DESCRIPTION: To trace a Python script, import the `trace_to_chrome` function from `lance.tracing` and call it at the beginning of your script, specifying the desired tracing level. A single JSON trace file will be generated upon the script's exit, suitable for Chrome's trace viewer.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_18

LANGUAGE: Python
CODE:
```
from lance.tracing import trace_to_chrome

trace_to_chrome(level="debug")

# rest of script
```

----------------------------------------

TITLE: LanceDB Encoding Metadata Key Specifications
DESCRIPTION: This section provides a detailed specification of the metadata keys used in LanceDB for column-level encoding. It describes each key's type, purpose, example values, and how it's used in Python to configure data storage and optimization.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_8

LANGUAGE: APIDOC
CODE:
```
Metadata Key Specifications:
- lance-encoding:compression
  Type: Compression
  Description: Specifies compression algorithm
  Example Values: zstd
  Example Usage (Python): metadata={"lance-encoding:compression": "zstd"}
- lance-encoding:compression-level
  Type: Compression
  Description: Zstd compression level (1-22)
  Example Values: 3
  Example Usage (Python): metadata={"lance-encoding:compression-level": "3"}
- lance-encoding:blob
  Type: Storage
  Description: Marks binary data (>4MB) for chunked storage
  Example Values: true/false
  Example Usage (Python): metadata={"lance-encoding:blob": "true"}
- lance-encoding:packed
  Type: Optimization
  Description: Struct memory layout optimization
  Example Values: true/false
  Example Usage (Python): metadata={"lance-encoding:packed": "true"}
- lance-encoding:structural-encoding
  Type: Nested Data
  Description: Encoding strategy for nested structures
  Example Values: miniblock/fullzip
  Example Usage (Python): metadata={"lance-encoding:structural-encoding": "miniblock"}
```

----------------------------------------

TITLE: Initialize Tokenizer and Load Wikitext Dataset (Python)
DESCRIPTION: This snippet initializes a Hugging Face tokenizer (gpt2) and loads the wikitext-103-raw-v1 dataset in streaming mode. The 'streaming=True' argument is crucial for processing large datasets without downloading the entire dataset upfront, allowing samples to be downloaded as needed.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_0

LANGUAGE: python
CODE:
```
import lance
import pyarrow as pa

from datasets import load_dataset
from transformers import AutoTokenizer
from tqdm.auto import tqdm  # optional for progress tracking

tokenizer = AutoTokenizer.from_pretrained('gpt2')

dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', streaming=True)['train']
dataset = dataset.shuffle(seed=1337)
```

----------------------------------------

TITLE: Example of Hierarchical Schema Definition
DESCRIPTION: This snippet demonstrates a sample schema definition within the LanceDB data format, showcasing primitive types, nested structs, and lists. It illustrates how complex data structures are defined before being flattened into a field list for metadata representation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_6

LANGUAGE: APIDOC
CODE:
```
a: i32
b: struct {
    c: list<i32>
    d: i32
}
```

----------------------------------------

TITLE: Define Custom PyTorch Dataset for Lance Data
DESCRIPTION: The `LanceDataset` class extends PyTorch's `Dataset` to provide an interface for loading data from a Lance dataset. It initializes by loading the specified Lance dataset and setting a `block_size` for token windows. The `__len__` method calculates the total number of possible starting indices, while `__getitem__` generates a window of indices and uses the `from_indices` utility to load and return corresponding 'input_ids' and 'labels' as PyTorch tensors, forming a causal sample for LLM training.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_2

LANGUAGE: python
CODE:
```
class LanceDataset(Dataset):
    def __init__(
        self,
        dataset_path,
        block_size,
    ):
        # Load the lance dataset from the saved path
        self.ds = lance.dataset(dataset_path)
        self.block_size = block_size

        # Doing this so the sampler never asks for an index at the end of text
        self.length = self.ds.count_rows() - block_size

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        """
        Generate a window of indices starting from the current idx to idx+block_size
        and return the tokens at those indices
        """
        window = np.arange(idx, idx + self.block_size)
        sample = from_indices(self.ds, window)

        return {"input_ids": torch.tensor(sample), "labels": torch.tensor(sample)}
```

----------------------------------------

TITLE: Complex SQL Filter Expression for Lance Dataset
DESCRIPTION: This SQL snippet provides an example of a complex filter expression that can be pushed down to the Lance storage system. It demonstrates the use of `IN`, `AND`, `OR`, `NOT`, and nested field access for filtering data efficiently at the storage layer.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_16

LANGUAGE: sql
CODE:
```
((label IN [10, 20]) AND (note['email'] IS NOT NULL))
    OR NOT note['created']
```

----------------------------------------

TITLE: Tune ANN Search Parameters in LanceDB (nprobes, refine_factor)
DESCRIPTION: This code demonstrates how to tune the performance of an Approximate Nearest Neighbor (ANN) search in LanceDB by adjusting 'nprobes' and 'refine_factor'. 'nprobes' controls the number of IVF partitions to search, while 'refine_factor' determines how many vectors are retrieved for re-ranking, balancing latency and recall.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_18

LANGUAGE: python
CODE:
```
%%time

sift1m.to_table(
    nearest={
        "column": "vector",
        "q": samples[0],
        "k": 10,
        "nprobes": 10,
        "refine_factor": 5,
    }
).to_pandas()
```

----------------------------------------

TITLE: Querying Lance Datasets with DuckDB in Python
DESCRIPTION: This snippet demonstrates how to perform SQL queries on a Lance dataset using DuckDB in Python. It shows examples of selecting all data and calculating the mean of a column, illustrating DuckDB's direct access to Lance datasets via Arrow compatibility.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/duckdb.md#_snippet_0

LANGUAGE: Python
CODE:
```
import duckdb # pip install duckdb

duckdb.query("SELECT * FROM my_lance_dataset")
# ┌─────────────┬─────────┬────────┐
# │   vector    │  item   │ price  │
# │   float[]   │ varchar │ double │
# ├─────────────┼─────────┼────────┤
# │ [3.1, 4.1]  │ foo     │   10.0 │
# │ [5.9, 26.5] │ bar     │   20.0 │
# └─────────────┴─────────┴────────┘

duckdb.query("SELECT mean(price) FROM my_lance_dataset")
# ┌─────────────┐
# │ mean(price) │
# │   double    │
# ├─────────────┤
# │        15.0 │
# └─────────────┘
```

----------------------------------------

TITLE: Use Sharded Sampler with LanceDataset for Distributed Training
DESCRIPTION: This example illustrates how to integrate `lance.sampler.ShardedFragmentSampler` with `LanceDataset` to control the data sampling strategy for distributed training environments. It shows how to configure the sampler with the current process's rank and the total number of processes (world size) for sharded data access.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_3

LANGUAGE: python
CODE:
```
from lance.sampler import ShardedFragmentSampler
from lance.torch.data import LanceDataset

# Load lance dataset into a PyTorch IterableDataset.
# with only columns "image" and "prompt".
dataset = LanceDataset(
    "diffusiondb_train.lance",
    columns=["image", "prompt"],
    batch_size=128,
    batch_readahead=8,  # Control multi-threading reads.
    sampler=ShardedFragmentSampler(
        rank=1,  # Rank of the current process
        world_size=8,  # Total number of processes
    ),
)
```

----------------------------------------

TITLE: Filter and Select Columns from Lance Dataset in TensorFlow
DESCRIPTION: This example illustrates efficient data loading from Lance into TensorFlow by specifying desired columns and applying filter conditions. It leverages Lance's columnar format for optimized data retrieval, reducing memory and processing overhead.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_1

LANGUAGE: python
CODE:
```
ds = lance.tf.data.from_lance(
    "s3://my-bucket/my-dataset",
    columns=["image", "label"],
    filter="split = 'train' AND collected_time > timestamp '2020-01-01'",
    batch_size=256)
```

----------------------------------------

TITLE: Python: Decode EncodedImageArray to FixedShapeImageTensorArray
DESCRIPTION: This Python example demonstrates how to load images from URIs into an `ImageURIArray`, read them into an `EncodedImageArray`, and then decode them into a `FixedShapeImageTensorArray`. It also illustrates how to provide a custom TensorFlow-based decoder function for the `to_tensor` method, allowing for flexible image processing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_5

LANGUAGE: python
CODE:
```
from lance.arrow import ImageURIArray

uris = [os.path.join(os.path.dirname(__file__), "images/1.png")]
encoded_images = ImageURIArray.from_uris(uris).read_uris()
print(encoded_images.to_tensor())

def tensorflow_decoder(images):
    import tensorflow as tf
    import numpy as np

    return np.stack(tf.io.decode_png(img.as_py(), channels=3) for img in images.storage)

print(encoded_images.to_tensor(tensorflow_decoder))
```

----------------------------------------

TITLE: Add and Populate Columns with Python UDF in Lance
DESCRIPTION: Shows how to add and populate new columns in a Lance dataset using a custom Python function (UDF). The UDF processes data in batches, and the example includes using `lance.batch_udf` with checkpointing for robust, expensive computations.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_2

LANGUAGE: python
CODE:
```
import lance
import pyarrow as pa
import numpy as np

table = pa.table({"id": pa.array([1, 2, 3])})
dataset = lance.write_dataset(table, "ids")

@lance.batch_udf(checkpoint_file="embedding_checkpoint.sqlite")
def add_random_vector(batch):
    embeddings = np.random.rand(batch.num_rows, 128).astype("float32")
    return pd.DataFrame({"embedding": embeddings})
dataset.add_columns(add_random_vector)
```

----------------------------------------

TITLE: Construct OpenAI prompt with context
DESCRIPTION: Defines a function `create_prompt` that takes a query and contextual information to build a structured prompt for a large language model. It dynamically appends context, ensuring the total prompt length stays within a specified token limit for the LLM.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_10

LANGUAGE: python
CODE:
```
def create_prompt(query, context):
    limit = 3750

    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(1, len(context)):
        if len("\n\n---\n\n".join(context.text[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(context.text[:i-1]) +
                prompt_end
            )
            break
        elif i == len(context)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(context.text) +
                prompt_end
            )
    return prompt
```

----------------------------------------

TITLE: Set DYLD_LIBRARY_PATH for Lance Python Debugging in LLDB
DESCRIPTION: Configures the `DYLD_LIBRARY_PATH` environment variable specifically for debugging Lance Python projects within LLDB. This ensures that the dynamic linker can find necessary shared libraries located in the third-party distribution directory.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/Debug.md#_snippet_1

LANGUAGE: lldb
CODE:
```
# /path/to/lance/python/.lldbinit
env DYLD_LIBRARY_PATH=/path/to/thirdparty/dist/lib:${DYLD_LIBRARY_PATH}
```

----------------------------------------

TITLE: Rename Top-Level Columns in LanceDB Dataset
DESCRIPTION: This snippet illustrates how to rename top-level columns in a LanceDB dataset using the `lance.LanceDataset.alter_columns` method. It shows a simple example of changing a column name and verifying the change by printing the dataset as a Pandas DataFrame.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_5

LANGUAGE: python
CODE:
```
table = pa.table({"id": pa.array([1, 2, 3])})
dataset = lance.write_dataset(table, "ids")
dataset.alter_columns({"path": "id", "name": "new_id"})
print(dataset.to_table().to_pandas())
#    new_id
# 0       1
# 1       2
# 2       3
```

----------------------------------------

TITLE: Python: Encode FixedShapeImageTensorArray to EncodedImageArray
DESCRIPTION: This Python example shows how to convert a `FixedShapeImageTensorArray` back into an `EncodedImageArray`. It first obtains a tensor array by decoding an `EncodedImageArray` (which was read from URIs) and then calls the `to_encoded()` method. This process is useful for saving processed images back into a compressed format.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_6

LANGUAGE: python
CODE:
```
from lance.arrow import ImageURIArray

uris = [image_uri]
tensor_images = ImageURIArray.from_uris(uris).read_uris().to_tensor()
tensor_images.to_encoded()
```

----------------------------------------

TITLE: Initialize LLM Training Environment with GPT2 and Lance
DESCRIPTION: This snippet imports essential libraries for LLM training, including Lance, PyTorch, and Hugging Face Transformers. It initializes the GPT2 tokenizer and model from pre-trained weights. Key hyperparameters such as learning rate, epochs, block size, batch size, device, and the Lance dataset path are defined, preparing the environment for subsequent data loading and model training.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_0

LANGUAGE: python
CODE:
```
import numpy as np
import lance

import torch
from torch.utils.data import Dataset, DataLoader, Sampler

from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm.auto import tqdm

# We'll be training the pre-trained GPT2 model in this example
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Also define some hyperparameters
lr = 3e-4
nb_epochs = 10
block_size = 1024
batch_size = 8
device = 'cuda:0'
dataset_path = 'wikitext_500K.lance'
```

----------------------------------------

TITLE: Define context window and stride parameters
DESCRIPTION: Initializes `window` and `stride` variables for creating rolling contextual windows from text data. These parameters define the size of each context (number of sentences) and the step size for generating subsequent contexts, respectively.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_3

LANGUAGE: python
CODE:
```
import numpy as np
import pandas as pd

window = 20
stride = 4
```

----------------------------------------

TITLE: Append New Fragments to an Existing Lance Dataset
DESCRIPTION: This example illustrates how to append new data to an existing Lance dataset. It retrieves the current dataset version, uses `lance.LanceOperation.Append` with the collected fragments, and commits them, ensuring the `read_version` is correctly set to maintain data consistency during the append operation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_2

LANGUAGE: python
CODE:
```
import lance

ds = lance.dataset(data_uri)
read_version = ds.version # record the read version

op = lance.LanceOperation.Append(schema, all_fragments)
lance.LanceDataset.commit(
    data_uri,
    op,
    read_version=read_version,
)
```

----------------------------------------

TITLE: Extract Video Frames from Lance Blob Data in Python
DESCRIPTION: This Python example illustrates how to fetch and process large binary video data stored as blobs in a Lance dataset. It uses `lance.dataset.LanceDataset.take_blobs` to retrieve a `BlobFile` object, then leverages the `av` library to open the video and extract frames within a specified time range without loading the entire video into memory.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/blob.md#_snippet_1

LANGUAGE: python
CODE:
```
import av # pip install av
import lance

ds = lance.dataset("./youtube.lance")
start_time, end_time = 500, 1000
blobs = ds.take_blobs([5], "video")
with av.open(blobs[0]) as container:
    stream = container.streams.video[0]
    stream.codec_context.skip_frame = "NONKEY"

    start_time = start_time / stream.time_base
    start_time = start_time.as_integer_ratio()[0]
    end_time = end_time / stream.time_base
    container.seek(start_time, stream=stream)

    for frame in container.decode(stream):
        if frame.time > end_time:
            break
        display(frame.to_image())
        clear_output(wait=True)
```

----------------------------------------

TITLE: Perform Approximate Nearest Neighbor (ANN) Search in LanceDB
DESCRIPTION: This Python snippet shows how to perform an Approximate Nearest Neighbor (ANN) search on a LanceDB dataset with an existing index. It queries a specified vector column for the 'k' nearest neighbors to a given query vector 'q', measuring the average query time. The result is converted to a Pandas DataFrame for display.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_17

LANGUAGE: python
CODE:
```
sift1m = lance.dataset(uri)

import time

tot = 0
for q in samples:
    start = time.time()
    tbl = sift1m.to_table(nearest={"column": "vector", "q": q, "k": 10})
    end = time.time()
    tot += (end - start)

print(f"Avg(sec): {tot / len(samples)}")
print(tbl.to_pandas())
```

----------------------------------------

TITLE: Cast Column Data Types in LanceDB Dataset
DESCRIPTION: This snippet explains how to change the data type of a column in a LanceDB dataset using `lance.LanceDataset.alter_columns`. It notes that this operation rewrites only the affected column's data files and that any existing index on the column will be dropped. An example is provided for converting a float32 embedding column to float16 to save disk space.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_7

LANGUAGE: python
CODE:
```
table = pa.table({
   "id": pa.array([1, 2, 3]),
   "embedding": pa.FixedShapeTensorArray.from_numpy_ndarray(
       np.random.rand(3, 128).astype("float32"))
})
dataset = lance.write_dataset(table, "embeddings")
dataset.alter_columns({"path": "embedding",
                       "data_type": pa.list_(pa.float16(), 128)})
print(dataset.schema)
# id: int64
# embedding: fixed_size_list<item: halffloat>[128]
#   child 0, item: halffloat
```

----------------------------------------

TITLE: Call OpenAI Completion API for text generation
DESCRIPTION: Defines the `complete` function to interact with OpenAI's `text-davinci-003` model. It sends a given prompt and retrieves the generated text completion, configuring parameters like temperature, max tokens, and presence/frequency penalties for desired output characteristics.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_11

LANGUAGE: python
CODE:
```
def complete(prompt):
    # query text-davinci-003
    res = openai.Completion.create(
        engine='text-davinci-003',
        prompt=prompt,
        temperature=0,
        max_tokens=400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    return res['choices'][0]['text'].strip()

# check that it works
query = "who was the 12th person on the moon and when did they land?"
complete(query)
```

----------------------------------------

TITLE: Build LanceDB Java Project with Maven
DESCRIPTION: Provides the Maven command to clean and package the entire LanceDB Java project, including its dependencies and sub-modules. This command compiles the Java code and prepares it for deployment.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_9

LANGUAGE: shell
CODE:
```
mvn clean package
```

----------------------------------------

TITLE: Import IPython.display for multimedia output
DESCRIPTION: Imports the `YouTubeVideo` class from `IPython.display`. This class is essential for embedding and displaying YouTube videos directly within an IPython or Jupyter environment, allowing for rich multimedia output.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_13

LANGUAGE: python
CODE:
```
from IPython.display import YouTubeVideo
```

----------------------------------------

TITLE: Initialize CLIP Model Instances, Tokenizer, and PyTorch Optimizer
DESCRIPTION: This snippet initializes instances of the `ImageEncoder`, `TextEncoder`, and `Head` modules, along with a Hugging Face `AutoTokenizer`. It then sets up a PyTorch `Adam` optimizer, explicitly defining separate learning rates for the image encoder, text encoder, and the combined head modules, preparing the model for training.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_7

LANGUAGE: python
CODE:
```
# Define image encoder, image head, text encoder, text head and a tokenizer for tokenizing the caption
img_encoder = ImageEncoder(model_name=Config.img_encoder_model).to('cuda')
img_head = Head(Config.img_embed_dim, Config.projection_dim).to('cuda')

tokenizer = AutoTokenizer.from_pretrained(Config.text_encoder_model)
text_encoder = TextEncoder(model_name=Config.text_encoder_model).to('cuda')
text_head = Head(Config.text_embed_dim, Config.projection_dim).to('cuda')

# Since we are optimizing two different models together, we will define parameters manually
parameters = [
    {"params": img_encoder.parameters(), "lr": Config.img_enc_lr},
    {"params": text_encoder.parameters(), "lr": Config.text_enc_lr},
    {
        "params": itertools.chain(
            img_head.parameters(),
            text_head.parameters(),
        ),
        "lr": Config.head_lr,
    },
]

optimizer = torch.optim.Adam(parameters)
```

----------------------------------------

TITLE: Build vector index for LanceDB dataset
DESCRIPTION: Creates an IVF_PQ (Inverted File Index with Product Quantization) index on the 'vector' column of the LanceDB dataset. This indexing significantly speeds up similarity search queries, making the retrieval of relevant contexts much faster.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_9

LANGUAGE: python
CODE:
```
ds = ds.create_index("vector",
                     index_type="IVF_PQ",
                     num_partitions=64,  # IVF
                     num_sub_vectors=96)  # PQ
```

----------------------------------------

TITLE: Import necessary modules for Lance and PyTorch deep learning artifact management
DESCRIPTION: This snippet imports essential Python libraries required for deep learning artifact management using Lance. It includes `os` and `shutil` for file system operations, `lance` for data storage, `pyarrow` for schema definition, `torch` for PyTorch model handling, and `collections.OrderedDict` for managing model state dictionaries.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_0

LANGUAGE: python
CODE:
```
import os
import shutil
import lance
import pyarrow as pa
import torch
from collections import OrderedDict
```

----------------------------------------

TITLE: Download and extract MeCab Ipadic model
DESCRIPTION: This snippet downloads the gzipped tarball of the MeCab Ipadic model from GitHub and then extracts its contents using tar. This is the first step in preparing the dictionary for building.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_0

LANGUAGE: bash
CODE:
```
curl -L -o mecab-ipadic-2.7.0-20070801.tar.gz "https://github.com/lindera-morphology/mecab-ipadic/archive/refs/tags/2.7.0-20070801.tar.gz"
tar xvf mecab-ipadic-2.7.0-20070801.tar.gz
```

----------------------------------------

TITLE: Process Image Captions and Images for Lance Dataset in Python
DESCRIPTION: This Python function `process` takes a list of image captions, reads corresponding image files, converts them to binary, and yields PyArrow RecordBatches. Each batch contains `image_id`, binary `image` data, and a list of `captions`, preparing data for a Lance dataset. It handles `FileNotFoundError` for missing images and uses `tqdm` for progress indication.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_3

LANGUAGE: python
CODE:
```
def process(captions):
    for img_id, img_captions in tqdm(captions):
        try:
            with open(os.path.join(image_folder, img_id), 'rb') as im:
                binary_im = im.read()

        except FileNotFoundError:
            print(f"img_id '{img_id}' not found in the folder, skipping.")
            continue

        img_id = pa.array([img_id], type=pa.string())
        img = pa.array([binary_im], type=pa.binary())
        capt = pa.array([img_captions], pa.list_(pa.string(), -1))

        yield pa.RecordBatch.from_arrays(
            [img_id, img, capt],
            ["image_id", "image", "captions"]
        )
```

----------------------------------------

TITLE: Create Empty Lance Dataset in Java
DESCRIPTION: This Java code demonstrates how to create a new, empty Lance dataset at a specified path. It defines the dataset's schema with 'id' (Int32) and 'name' (Utf8) fields, initializes a `BufferAllocator`, and uses `Dataset.create` to persist the schema. The snippet also shows how to access dataset version information immediately after creation.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_1

LANGUAGE: Java
CODE:
```
void createDataset() throws IOException, URISyntaxException {
    String datasetPath = tempDir.resolve("write_stream").toString();
    Schema schema =
            new Schema(
                    Arrays.asList(
                            Field.nullable("id", new ArrowType.Int(32, true)),
                            Field.nullable("name", new ArrowType.Utf8())),
                    null);
    try (BufferAllocator allocator = new RootAllocator();) {
        Dataset.create(allocator, datasetPath, schema, new WriteParams.Builder().build());
        try (Dataset dataset = Dataset.create(allocator, datasetPath, schema, new WriteParams.Builder().build());) {
            dataset.version();
            dataset.latestVersion();
        }
    }
}
```

----------------------------------------

TITLE: Generate contextual text windows from video transcripts
DESCRIPTION: Defines the `contextualize` function to create overlapping text contexts from video transcripts. It processes each video, combining sentences into windows based on `window` and `stride` parameters, and returns a new DataFrame with these generated contexts.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_4

LANGUAGE: python
CODE:
```
def contextualize(raw_df, window, stride):
    def process_video(vid):
        # For each video, create the text rolling window
        text = vid.text.values
        time_end = vid["end"].values
        contexts = vid.iloc[:-window:stride, :].copy()
        contexts["text"] = [' '.join(text[start_i:start_i+window])
                            for start_i in range(0, len(vid)-window, stride)]
        contexts["end"] = [time_end[start_i+window-1]
                            for start_i in range(0, len(vid)-window, stride)]
        return contexts
    # concat result from all videos
    return pd.concat([process_video(vid) for _, vid in raw_df.groupby("title")])

df = contextualize(data.to_pandas(), 20, 4)
```

----------------------------------------

TITLE: Display answer and relevant YouTube video segment
DESCRIPTION: Executes the full Q&A pipeline: poses a query, retrieves the answer and relevant context, prints the generated answer, and then displays the most relevant YouTube video segment using `YouTubeVideo` at the precise timestamp where the context was found.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_14

LANGUAGE: python
CODE:
```
query = ("Which training method should I use for sentence transformers "
                     "when I only have pairs of related sentences?")
completion, context = answer(query)

print(completion)
top_match = context.iloc[0]
YouTubeVideo(top_match["url"].split("/")[-1], start=top_match["start"])
```

----------------------------------------

TITLE: Create LanceDB dataset from embeddings and contexts
DESCRIPTION: Converts the generated embeddings into a LanceDB vector table and combines it with the original contextualized DataFrame. This process creates a new LanceDB dataset named 'chatbot.lance' on disk, ready for efficient vector search operations.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_8

LANGUAGE: python
CODE:
```
import lance
import pyarrow as pa
from lance.vector import vec_to_table

table = vec_to_table(np.array(embeds))
combined = pa.Table.from_pandas(df).append_column("vector", table["vector"])
ds = lance.write_dataset(combined, "chatbot.lance")
```

----------------------------------------

TITLE: Create LanceDB Index for GIST-1M Dataset
DESCRIPTION: Builds an index on the GIST-1M Lance dataset using `index.py`. The specified parameters for IVF partitions (-i) and PQ subvectors (-p) are crucial for optimizing query performance.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_11

LANGUAGE: sh
CODE:
```
./index.py ./.lancedb/gist1m.lance -i 256 -p 120
```

----------------------------------------

TITLE: Generate Lance Dataset
DESCRIPTION: This command executes the `datagen.py` script to create the Lance dataset required for the Cohere wiki text embedding benchmark.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/wiki/README.md#_snippet_0

LANGUAGE: bash
CODE:
```
python datagen.py
```

----------------------------------------

TITLE: Generate answer using vector search and LLM
DESCRIPTION: Combines embedding generation, LanceDB vector search, and prompt creation to answer a question. It first embeds the query, then finds the most relevant contexts using vector similarity in LanceDB, and finally uses an LLM to formulate an answer based on those retrieved contexts.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_12

LANGUAGE: python
CODE:
```
def answer(question):
    emb = embed_func(query)[0]
    context = ds.to_table(
        nearest={
            "column": "vector",
            "k": 3,
            "q": emb,
            "nprobes": 20,
            "refine_factor": 100
        }).to_pandas()
    prompt = create_prompt(question, context)
    return complete(prompt), context.reset_index()
```

----------------------------------------

TITLE: Create LanceDB Index for SIFT-1M Dataset
DESCRIPTION: Builds an index on the SIFT-1M Lance dataset using `index.py`. The specified parameters for IVF partitions (-i) and PQ subvectors (-p) are crucial for optimizing query performance.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_6

LANGUAGE: sh
CODE:
```
./index.py ./.lancedb/sift1m.lance -i 256 -p 16
```

----------------------------------------

TITLE: LanceDB Manifest Naming Schemes
DESCRIPTION: Describes the V1 (legacy) and V2 (new) naming conventions for manifest files in LanceDB, emphasizing the V2 scheme's zero-padded, descending-sortable versioning for efficient latest manifest retrieval.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_10

LANGUAGE: APIDOC
CODE:
```
Manifest Naming Schemes:
  V1: _versions/{version}.manifest
  V2: _versions/{u64::MAX - version:020}.manifest
```

----------------------------------------

TITLE: Initialize LanceDB Dataset and PyTorch DataLoader
DESCRIPTION: This snippet demonstrates how to initialize a CLIPLanceDataset using a LanceDB file (flickr8k.lance) and then wrap it with a PyTorch DataLoader. It configures the dataset with tokenization and augmentations, and the dataloader for efficient batch processing during training.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_8

LANGUAGE: python
CODE:
```
dataset = CLIPLanceDataset(
    lance_path="flickr8k.lance",
    max_len=Config.max_len,
    tokenizer=tokenizer,
    transforms=train_augments
)

dataloader = DataLoader(
    dataset,
    shuffle=False,
    batch_size=Config.bs,
    pin_memory=True
)
```

----------------------------------------

TITLE: Run GIST-1M Benchmark and Store Results
DESCRIPTION: Executes the benchmark for GIST-1M using `metrics.py`, querying the indexed dataset with specified parameters like number of results to fetch (-k) and query vectors (-q). The results, including mean query time and recall@1, are saved to a CSV file.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_12

LANGUAGE: sh
CODE:
```
./metrics.py ./.lancedb/gist1m.lance results-gist.csv -i 256 -p 120 -q ./.lancedb/gist_query.lance -k 1
```

----------------------------------------

TITLE: Run SIFT-1M Benchmark and Store Results
DESCRIPTION: Executes the benchmark for SIFT-1M using `metrics.py`, querying the indexed dataset with specified parameters like number of results to fetch (-k) and query vectors (-q). The results, including mean query time and recall@1, are saved to a CSV file.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_7

LANGUAGE: sh
CODE:
```
./metrics.py ./.lancedb/sift1m.lance results-sift.csv -i 256 -p 16 -q ./.lancedb/sift_query.lance -k 1
```

----------------------------------------

TITLE: Object Store General Configuration Options
DESCRIPTION: Details configuration parameters applicable to all object stores, including network, security, and retry settings. These options control connection behavior, certificate validation, timeouts, user agents, proxy usage, and client-side retry logic.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_2

LANGUAGE: APIDOC
CODE:
```
Key: allow_http
Description: Allow non-TLS, i.e. non-HTTPS connections. Default, False.

Key: download_retry_count
Description: Number of times to retry a download. Default, 3. This limit is applied when the HTTP request succeeds but the response is not fully downloaded, typically due to a violation of request_timeout.

Key: allow_invalid_certificates
Description: Skip certificate validation on https connections. Default, False. Warning: This is insecure and should only be used for testing.

Key: connect_timeout
Description: Timeout for only the connect phase of a Client. Default, 5s.

Key: request_timeout
Description: Timeout for the entire request, from connection until the response body has finished. Default, 30s.

Key: user_agent
Description: User agent string to use in requests.

Key: proxy_url
Description: URL of a proxy server to use for requests. Default, None.

Key: proxy_ca_certificate
Description: PEM-formatted CA certificate for proxy connections

Key: proxy_excludes
Description: List of hosts that bypass proxy. This is a comma separated list of domains and IP masks. Any subdomain of the provided domain will be bypassed. For example, example.com, 192.168.1.0/24 would bypass https://api.example.com, https://www.example.com, and any IP in the range 192.168.1.0/24.

Key: client_max_retries
Description: Number of times for a s3 client to retry the request. Default, 10.

Key: client_retry_timeout
Description: Timeout for a s3 client to retry the request in seconds. Default, 180.
```

----------------------------------------

TITLE: Import necessary libraries for CLIP model training
DESCRIPTION: This snippet imports essential Python libraries like cv2, lance, numpy, torch, timm, and transformers, which are required for building and training a multi-modal CLIP model. It also includes utility libraries such as itertools and tqdm, and a warning filter.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_0

LANGUAGE: python
CODE:
```
import cv2
import lance

import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

import timm
from transformers import AutoModel, AutoTokenizer

import itertools
from tqdm import tqdm

import warnings
warnings.simplefilter('ignore')
```

----------------------------------------

TITLE: Build user dictionary with Lindera
DESCRIPTION: This command demonstrates how to build a custom user dictionary using 'lindera build'. It takes a CSV file as input and creates a new user dictionary, which can be used to extend the base language model.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_2

LANGUAGE: bash
CODE:
```
lindera build --build-user-dictionary --dictionary-kind=ipadic user_dict/userdict.csv user_dict2
```

----------------------------------------

TITLE: Google Cloud Storage Configuration Keys
DESCRIPTION: Reference for configuration keys available for Google Cloud Storage when used with LanceDB. These keys can be set as environment variables or within the `storage_options` parameter.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_8

LANGUAGE: APIDOC
CODE:
```
Key / Environment Variable | Description
--------------------------|------------
google_service_account / service_account | Path to the service account JSON file.
google_service_account_key / service_account_key | The serialized service account key.
google_application_credentials / application_credentials | Path to the application credentials.
```

----------------------------------------

TITLE: Load YouTube transcription dataset
DESCRIPTION: Downloads and loads the 'jamescalam/youtube-transcriptions' dataset from Hugging Face datasets. The 'train' split is specified to retrieve the main training portion of the data.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_1

LANGUAGE: python
CODE:
```
from datasets import load_dataset

data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data
```

----------------------------------------

TITLE: Index Lance Data for Benchmarking
DESCRIPTION: This command runs the `index.py` script to build an index on the generated Lance dataset. It configures the index with an L2 metric, 2048 partitions, and 96 sub-vectors for optimized benchmarking.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/wiki/README.md#_snippet_1

LANGUAGE: bash
CODE:
```
python index.py --metric L2 --num-partitions 2048 --num-sub-vectors 96
```

----------------------------------------

TITLE: Jieba User Dictionary Configuration File (config.json)
DESCRIPTION: JSON configuration for Jieba user dictionaries. This file, named `config.json`, specifies an optional 'main' dictionary and an array of paths to additional 'users' dictionary files. It should be placed in the model's root directory.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_3

LANGUAGE: json
CODE:
```
{
    "main": "dict.txt",
    "users": ["path/to/user/dict.txt"]
}
```

----------------------------------------

TITLE: Batch and generate embeddings using OpenAI API
DESCRIPTION: Configures the OpenAI API key and defines a `to_batches` helper function for processing data in chunks. It then uses this to generate embeddings for the contextualized text in batches, improving efficiency and adhering to API best practices by reducing individual API calls.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_7

LANGUAGE: python
CODE:
```
from tqdm.auto import tqdm
import math

openai.api_key = "sk-..."

# We request in batches rather than 1 embedding at a time
def to_batches(arr, batch_size):
    length = len(arr)
    def _chunker(arr):
        for start_i in range(0, len(df), batch_size):
            yield arr[start_i:start_i+batch_size]
    # add progress meter
    yield from tqdm(_chunker(arr), total=math.ceil(length / batch_size))

batch_size = 1000
batches = to_batches(df.text.values.tolist(), batch_size)
embeds = [emb for c in batches for emb in rate_limited(c)]
```

----------------------------------------

TITLE: Download Jieba Language Model
DESCRIPTION: Command-line instruction to download the Jieba language model for use with LanceDB. The model will be automatically stored in the default Jieba model directory within the configured language model home.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_1

LANGUAGE: bash
CODE:
```
python -m lance.download jieba
```

----------------------------------------

TITLE: Read and Inspect Lance Dataset in Rust
DESCRIPTION: This Rust function `read_dataset` shows how to open an existing Lance dataset from a given path. It uses a `scanner` to create a `batch_stream` and then iterates through each `RecordBatch`, printing its number of rows, columns, schema, and the entire batch content.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_1

LANGUAGE: Rust
CODE:
```
// Reads dataset from the given path and prints batch size, schema for all record batches. Also extracts and prints a slice from the first batch
async fn read_dataset(data_path: &str) {
    let dataset = Dataset::open(data_path).await.unwrap();
    let scanner = dataset.scan();

    let mut batch_stream = scanner.try_into_stream().await.unwrap().map(|b| b.unwrap());

    while let Some(batch) = batch_stream.next().await {
        println!("Batch size: {}, {}", batch.num_rows(), batch.num_columns()); // print size of batch
        println!("Schema: {:?}", batch.schema()); // print schema of recordbatch

        println!("Batch: {:?}", batch); // print the entire recordbatch (schema and data)
    }
} // End read dataset
```

----------------------------------------

TITLE: Define configuration class for CLIP model hyperparameters
DESCRIPTION: This Python class, `Config`, centralizes all hyperparameters necessary for training the CLIP model. It includes image and text dimensions, learning rates for different components, batch size, maximum sequence length, projection dimensions, temperature, number of epochs, and the names of the image and text encoder models.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_1

LANGUAGE: python
CODE:
```
class Config:
    img_size = (128, 128)
    bs = 32
    head_lr = 1e-3
    img_enc_lr = 1e-4
    text_enc_lr = 1e-5
    max_len = 18
    img_embed_dim = 2048
    text_embed_dim = 768
    projection_dim = 256
    temperature = 1.0
    num_epochs = 2
    img_encoder_model = 'resnet50'
    text_encoder_model = 'bert-base-cased'
```

----------------------------------------

TITLE: LanceDB External Manifest Store Reader Operations
DESCRIPTION: Explains the reader's load process when an external manifest store is in use, including retrieving the manifest path, reattempting synchronization if needed, and ensuring the dataset remains portable.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_13

LANGUAGE: APIDOC
CODE:
```
External Store Reader Load Process:
  1. GET_EXTERNAL_STORE base_uri, version, path
     - Action: Retrieve manifest path from external store.
     - Condition: If path does not end in UUID, return path.
  2. COPY_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid} mydataset.lance/_versions/{version}.manifest
     - Action: Reattempt synchronization (copy staged to final).
  3. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest
     - Action: Update external store to point to final manifest.
  4. RETURN mydataset.lance/_versions/{version}.manifest
     - Action: Always return the finalized path.
     - Error: Return error if synchronization fails.
```

----------------------------------------

TITLE: Generate Text-to-Image 10M Dataset in Lance Format
DESCRIPTION: This snippet demonstrates how to create the 'text2image-10m' dataset in Lance format using scripts from the 'big-ann-benchmarks' repository. Upon execution, it generates two Lance datasets: a base dataset and a corresponding queries/ground truth dataset, essential for benchmarking.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/bigann/README.md#_snippet_1

LANGUAGE: bash
CODE:
```
python ./big-ann-benchmarks/create_dataset.py --dataset yfcc-10M
./dataset.py -t text2image-10m data/text2image1B
```

----------------------------------------

TITLE: Run Flat Index Search Benchmark
DESCRIPTION: Executes the benchmark script to measure performance of flat index search. This command generates `benchmark.csv` for raw data and `benchmark.html` for latency plots.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/flat/README.md#_snippet_0

LANGUAGE: Shell
CODE:
```
./benchmark.py
```

----------------------------------------

TITLE: PyTorch Model Training Loop with LanceDB DataLoader
DESCRIPTION: This snippet illustrates a complete PyTorch training loop. It begins by defining a `LanceDataset` and `LanceSampler` to efficiently load data, then sets up a `DataLoader`. The code proceeds to initialize a PyTorch model and an AdamW optimizer. The core of the snippet is the epoch-based training loop, which includes iterating through batches, performing forward and backward passes, calculating loss, updating model parameters, and reporting training perplexity.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_4

LANGUAGE: python
CODE:
```
dataset = LanceDataset(dataset_path, block_size)
sampler = LanceSampler(dataset, block_size)
dataloader = DataLoader(
    dataset,
    shuffle=False,
    batch_size=batch_size,
    sampler=sampler,
    pin_memory=True
)

# Define the optimizer, training loop and train the model!
model = model.to(device)
model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

for epoch in range(nb_epochs):
    print(f"========= Epoch: {epoch+1} / {nb_epochs} ========")
    epoch_loss = []
    prog_bar = tqdm(dataloader, total=len(dataloader))
    for batch in prog_bar:
        optimizer.zero_grad(set_to_none=True)

        # Put both input_ids and labels to the device
        for k, v in batch.items():
            batch[k] = v.to(device)

        # Perform one forward pass and get the loss
        outputs = model(**batch)
        loss = outputs.loss

        # Perform backward pass
        loss.backward()
        optimizer.step()

        prog_bar.set_description(f"loss: {loss.item():.4f}")

        epoch_loss.append(loss.item())

    # Calculate training perplexity for this epoch
    try:
        perplexity = np.exp(np.mean(epoch_loss))
    except OverflowError:
        perplexity = float("-inf")

    print(f"train_perplexity: {perplexity}")
```

----------------------------------------

TITLE: Create PyArrow RecordBatchReader from Processed Samples (Python)
DESCRIPTION: This code creates a PyArrow RecordBatchReader, which acts as an iterator over the data generated by the 'process_samples' function. It uses the defined schema to ensure data consistency and prepares the stream of record batches for writing to a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_4

LANGUAGE: python
CODE:
```
reader = pa.RecordBatchReader.from_batches(
    schema,
    process_samples(dataset, num_samples=500_000, field='text') # For 500K samples
)
```

----------------------------------------

TITLE: Download and Extract GIST-1M Dataset
DESCRIPTION: Downloads the GIST-1M dataset archive from the specified FTP server and extracts its contents. This is a prerequisite for generating Lance datasets for GIST-1M.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_8

LANGUAGE: sh
CODE:
```
wget ftp://ftp.irisa.fr/local/texmex/corpus/gist.tar.gz
tar -xzf gist.tar.gz
```

----------------------------------------

TITLE: Create a Lance Dataset from Arrow RecordBatches in Rust
DESCRIPTION: Demonstrates how to write a collection of Arrow RecordBatches and an Arrow Schema into a new Lance Dataset. It uses default write parameters and an iterator for the batches.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_1

LANGUAGE: rust
CODE:
```
use lance::{dataset::WriteParams, Dataset};

let write_params = WriteParams::default();
let mut reader = RecordBatchIterator::new(
    batches.into_iter().map(Ok),
    schema
);
Dataset::write(reader, &uri, Some(write_params)).await.unwrap();
```

----------------------------------------

TITLE: Create TensorFlow Dataset from Lance URI
DESCRIPTION: This snippet demonstrates how to initialize a `tf.data.Dataset` directly from a Lance dataset URI using `lance.tf.data.from_lance`. It also shows how to chain standard TensorFlow dataset operations like shuffling and mapping for data preprocessing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_0

LANGUAGE: python
CODE:
```
import tensorflow as tf
import lance

# Create tf dataset
ds = lance.tf.data.from_lance("s3://my-bucket/my-dataset")

# Chain tf dataset with other tf primitives

for batch in ds.shuffling(32).map(lambda x: tf.io.decode_png(x["image"])):
    print(batch)
```

----------------------------------------

TITLE: Write PyArrow Record Batches to Lance Dataset in Python
DESCRIPTION: This Python code demonstrates how to write PyArrow Record Batches to a Lance dataset. It creates a `RecordBatchReader` from the defined schema and the output of the `process` function, then uses `lance.write_dataset` to efficiently persist the data to a file named 'flickr8k.lance' on disk.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_5

LANGUAGE: python
CODE:
```
reader = pa.RecordBatchReader.from_batches(schema, process(captions))
lance.write_dataset(reader, "flickr8k.lance", schema)
```

----------------------------------------

TITLE: Implement PyTorch CLIP Model Training Loop
DESCRIPTION: This code defines the core training loop for a CLIP model. It sets all model components to training mode, iterates through epochs and batches from the DataLoader, performs forward and backward passes, calculates loss, and updates model weights using an optimizer. A progress bar provides real-time feedback on the training process.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_9

LANGUAGE: python
CODE:
```
img_encoder.train()
img_head.train()
text_encoder.train()
text_head.train()

for epoch in range(Config.num_epochs):
    print(f"{'='*20} Epoch: {epoch+1} / {Config.num_epochs} {'='*20}")

    prog_bar = tqdm(dataloader)
    for img, caption in prog_bar:
        optimizer.zero_grad(set_to_none=True);

        img_embed, text_embed = forward(img, caption)
        loss = loss_fn(img_embed, text_embed, temperature=Config.temperature).mean()

        loss.backward()
        optimizer.step()

        prog_bar.set_description(f"loss: {loss.item():.4f}")
    print()
```

----------------------------------------

TITLE: Build Ipadic language model with Lindera
DESCRIPTION: This command uses the 'lindera build' tool to compile the Ipadic dictionary. It specifies the dictionary kind as 'ipadic' and points to the extracted model directory to create the main dictionary.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_1

LANGUAGE: bash
CODE:
```
lindera build --dictionary-kind=ipadic mecab-ipadic-2.7.0-20070801 main
```

----------------------------------------

TITLE: Write Lance Dataset in Rust
DESCRIPTION: This Rust function `write_dataset` demonstrates how to create and write a Lance dataset to a specified path. It defines a schema with `UInt32` fields, creates a `RecordBatch` with sample data, and uses `WriteParams` to set the write mode to `Overwrite` before writing the dataset to disk.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_0

LANGUAGE: Rust
CODE:
```
// Writes sample dataset to the given path
async fn write_dataset(data_path: &str) {
    // Define new schema
    let schema = Arc::new(Schema::new(vec![
        Field::new("key", DataType::UInt32, false),
        Field::new("value", DataType::UInt32, false),
    ]));

    // Create new record batches
    let batch = RecordBatch::try_new(
        schema.clone(),
        vec![
            Arc::new(UInt32Array::from(vec![1, 2, 3, 4, 5, 6])),
            Arc::new(UInt32Array::from(vec![6, 7, 8, 9, 10, 11])),
        ],
    )
    .unwrap();

    let batches = RecordBatchIterator::new([Ok(batch)], schema.clone());

    // Define write parameters (e.g. overwrite dataset)
    let write_params = WriteParams {
        mode: WriteMode::Overwrite,
        ..Default::default()
    };

    Dataset::write(batches, data_path, Some(write_params))
        .await
        .unwrap();
} // End write dataset
```

----------------------------------------

TITLE: Download and Extract SIFT-1M Dataset
DESCRIPTION: Downloads the SIFT-1M dataset archive from the specified FTP server and extracts its contents. This is a prerequisite for generating Lance datasets for SIFT-1M.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_1

LANGUAGE: sh
CODE:
```
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz
```

----------------------------------------

TITLE: Query Lance Dataset with DuckDB
DESCRIPTION: Demonstrates querying a Lance dataset directly using DuckDB. It highlights the integration with DuckDB for SQL-based data exploration and retrieval, enabling powerful analytical queries.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_5

LANGUAGE: python
CODE:
```
import duckdb

# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()
```

----------------------------------------

TITLE: Build LanceDB Rust JNI Module
DESCRIPTION: Specifies the command to build only the Rust-based JNI (Java Native Interface) module of LanceDB. This is useful for developers focusing on the native components without rebuilding the entire Java project.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_10

LANGUAGE: shell
CODE:
```
cargo build
```

----------------------------------------

TITLE: Initialize Lance Dataset from Local Path
DESCRIPTION: This Python snippet demonstrates how to initialize a Lance dataset object from a local file path. It sets up the dataset for subsequent read operations, enabling access to the data stored in the specified Lance file.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_12

LANGUAGE: python
CODE:
```
ds = lance.dataset("./imagenet.lance")
```

----------------------------------------

TITLE: Implement custom PyTorch Dataset for Lance-based CLIP training
DESCRIPTION: This `CLIPLanceDataset` class extends PyTorch's `Dataset` to handle Lance datasets for CLIP model training. It initializes with a Lance dataset path, an optional tokenizer, and image transformations, providing methods to retrieve pre-processed images and tokenized captions for use in a DataLoader.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_3

LANGUAGE: python
CODE:
```
class CLIPLanceDataset(Dataset):
    """Custom Dataset to load images and their corresponding captions"""
    def __init__(self, lance_path, max_len=18, tokenizer=None, transforms=None):
        self.ds = lance.dataset(lance_path)
        self.max_len = max_len
        # Init a new tokenizer if not specified already
        self.tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') if not tokenizer else tokenizer
        self.transforms = transforms

    def __len__(self):
        return self.ds.count_rows()

    def __getitem__(self, idx):
        # Load the image and caption
        img = load_image(self.ds, idx)
        caption = load_caption(self.ds, idx)

        # Apply transformations to the images
        if self.transforms:
            img = self.transforms(img)

        # Tokenize the caption
        caption = self.tokenizer(
            caption,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors='pt'
        )
        # Flatten each component of tokenized caption otherwise they will cause size mismatch errors during training
        caption = {k: v.flatten() for k, v in caption.items()}

        return img, caption
```

----------------------------------------

TITLE: Azure Blob Storage Configuration Keys
DESCRIPTION: Reference for configuration keys available for Azure Blob Storage when used with LanceDB. These keys can be set as environment variables or within the `storage_options` parameter.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_10

LANGUAGE: APIDOC
CODE:
```
Key / Environment Variable | Description
--------------------------|------------
azure_storage_account_name / account_name | The name of the azure storage account.
azure_storage_account_key / account_key | The serialized service account key.
azure_client_id / client_id | Service principal client id for authorizing requests.
azure_client_secret / client_secret | Service principal client secret for authorizing requests.
azure_tenant_id / tenant_id | Tenant id used in oauth flows.
azure_storage_sas_key / azure_storage_sas_token / sas_key / sas_token | Shared access signature. The signature is expected to be percent-encoded, much like they are provided in the azure storage explorer or azure portal.
azure_storage_token / bearer_token / token | Bearer token.
azure_storage_use_emulator / object_store_use_emulator / use_emulator | Use object store with azurite storage emulator.
azure_endpoint / endpoint | Override the endpoint used to communicate with blob storage.
azure_use_fabric_endpoint / use_fabric_endpoint | Use object store with url scheme account.dfs.fabric.microsoft.com.
azure_msi_endpoint / azure_identity_endpoint / identity_endpoint / msi_endpoint | Endpoint to request a imds managed identity token.
azure_object_id / object_id | Object id for use with managed identity authentication.
azure_msi_resource_id / msi_resource_id | Msi resource id for use with managed identity authentication.
azure_federated_token_file / federated_token_file | File containing token for Azure AD workload identity federation.
azure_use_azure_cli / use_azure_cli | Use azure cli for acquiring access token.
azure_disable_tagging / disable_tagging | Disables tagging objects. This can be desirable if not supported by the backing store.
```

----------------------------------------

TITLE: Define Function to Process and Tokenize Samples for Lance (Python)
DESCRIPTION: This function iterates over a dataset, tokenizes individual samples using the 'tokenize' function, and yields PyArrow RecordBatches. It processes a specified number of samples, skipping empty ones, and is designed to efficiently prepare data for writing to a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_2

LANGUAGE: python
CODE:
```
def process_samples(dataset, num_samples=100_000, field='text'):
    current_sample = 0
    for sample in tqdm(dataset, total=num_samples):
        # If we have added all 5M samples, stop
        if current_sample == num_samples:
            break
        if not sample[field]:
            continue
        # Tokenize the current sample
        tokenized_sample = tokenize(sample, field)
        # Increment the counter
        current_sample += 1
        # Yield a PyArrow RecordBatch
        yield pa.RecordBatch.from_arrays(
            [tokenized_sample],
            names=["input_ids"]
        )
```

----------------------------------------

TITLE: Read a Lance Dataset and Collect RecordBatches in Rust
DESCRIPTION: Opens an existing Lance Dataset from a specified path, scans its content, and collects all resulting RecordBatches into a vector. Error handling is included.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_2

LANGUAGE: rust
CODE:
```
let dataset = Dataset::open(path).await.unwrap();
let mut scanner = dataset.scan();
let batches: Vec<RecordBatch> = scanner
    .try_into_stream()
    .await
    .unwrap()
    .map(|b| b.unwrap())
    .collect::<Vec<RecordBatch>>()
    .await;
```

----------------------------------------

TITLE: Visualize Latency vs. NProbes with IVF and PQ
DESCRIPTION: This snippet generates a scatter plot using seaborn to visualize the relationship between 'nprobes' and '50%' (median response time). It uses 'ivf' for color encoding and 'pq' for marker style, allowing for a multi-dimensional analysis of performance.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_7

LANGUAGE: python
CODE:
```
sns.scatterplot(data=df, x="nprobes", y="50%", hue="ivf", style="pq")
```

----------------------------------------

TITLE: Write HuggingFace Dataset to Lance Format
DESCRIPTION: This Python code snippet demonstrates how to load a HuggingFace dataset and write it into the Lance format. It uses the `datasets` library to load a specific split of a dataset and then `lance.write_dataset` to save it as a Lance file. Dependencies include `datasets` and `lance`.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/huggingface.md#_snippet_0

LANGUAGE: python
CODE:
```
import datasets # pip install datasets
import lance

lance.write_dataset(datasets.load_dataset(
    "poloclub/diffusiondb", split="train[:10]"
), "diffusiondb_train.lance")
```

----------------------------------------

TITLE: Describe Median Latency by PQ Configuration
DESCRIPTION: This snippet groups the DataFrame by the 'pq' column and calculates descriptive statistics for the '50%' (median response time) column. This provides insights into latency performance based on different Product Quantization (PQ) configurations.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_4

LANGUAGE: python
CODE:
```
df.groupby("pq")["50%"].describe()
```

----------------------------------------

TITLE: Check number of generated contexts
DESCRIPTION: Prints the total number of contextualized entries created after processing the dataset. This helps verify the output of the `contextualize` function and understand the volume of data prepared for embedding.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_5

LANGUAGE: python
CODE:
```
len(df)
```

----------------------------------------

TITLE: Convert HuggingFace Dataset to LanceDB
DESCRIPTION: This snippet demonstrates how to load a dataset from HuggingFace and convert it into a Lance dataset using `lance.write_dataset`. This is a foundational step for preparing data to be used with LanceDB's PyTorch integration.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_0

LANGUAGE: python
CODE:
```
import datasets # pip install datasets
import lance

hf_ds = datasets.load_dataset(
    "poloclub/diffusiondb",
    split="train",
    # name="2m_first_1k",  # for a smaller subset of the dataset
)
lance.write_dataset(hf_ds, "diffusiondb_train.lance")
```

----------------------------------------

TITLE: Build IVF_PQ Vector Index on Lance Dataset
DESCRIPTION: Creates an IVF_PQ (Inverted File Index with Product Quantization) index on the 'vector' column of the Lance dataset. This index significantly speeds up nearest neighbor searches by efficiently partitioning and quantizing the vector space.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_8

LANGUAGE: python
CODE:
```
sift1m.create_index("vector",
                    index_type="IVF_PQ",
                    num_partitions=256,  # IVF
                    num_sub_vectors=16)  # PQ
```

----------------------------------------

TITLE: LanceDB S3 Storage Options Reference
DESCRIPTION: Reference for available keys in the `storage_options` parameter for S3 and S3-compatible storage configurations in LanceDB. These options can be set via environment variables or directly in the `storage_options` dictionary, controlling aspects like region, endpoint, and encryption.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_4

LANGUAGE: APIDOC
CODE:
```
S3 Storage Options:
- aws_region / region: The AWS region the bucket is in. This can be automatically detected when using AWS S3, but must be specified for S3-compatible stores.
- aws_access_key_id / access_key_id: The AWS access key ID to use.
- aws_secret_access_key / secret_access_key: The AWS secret access key to use.
- aws_session_token / session_token: The AWS session token to use.
- aws_endpoint / endpoint: The endpoint to use for S3-compatible stores.
- aws_virtual_hosted_style_request / virtual_hosted_style_request: Whether to use virtual hosted-style requests, where bucket name is part of the endpoint. Meant to be used with `aws_endpoint`. Default, `False`.
- aws_s3_express / s3_express: Whether to use S3 Express One Zone endpoints. Default, `False`. See more details below.
- aws_server_side_encryption: The server-side encryption algorithm to use. Must be one of `"AES256"`, `"aws:kms"`, or `"aws:kms:dsse"`. Default, `None`.
- aws_sse_kms_key_id: The KMS key ID to use for server-side encryption. If set, `aws_server_side_encryption` must be `"aws:kms"` or `"aws:kms:dsse"`.
- aws_sse_bucket_key_enabled: Whether to use bucket keys for server-side encryption.
```

----------------------------------------

TITLE: Define OpenAI embedding function with rate limiting and retry
DESCRIPTION: Sets up an embedding function using OpenAI's `text-embedding-ada-002` model. It incorporates `ratelimiter` to respect API rate limits and `retry` for robust API calls, ensuring successful embedding generation even with transient network issues.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_6

LANGUAGE: python
CODE:
```
import functools
import openai
import ratelimiter
from retry import retry

embed_model = "text-embedding-ada-002"

# API limit at 60/min == 1/sec
limiter = ratelimiter.RateLimiter(max_calls=0.9, period=1.0)

# Get the embedding with retry
@retry(tries=10, delay=1, max_delay=30, backoff=3, jitter=1)
def embed_func(c):
    rs = openai.Embedding.create(input=c, engine=embed_model)
    return [record["embedding"] for record in rs["data"]]

rate_limited = limiter(embed_func)
```

----------------------------------------

TITLE: Add Lance SDK Java Maven Dependency
DESCRIPTION: This snippet provides the Maven XML configuration required to include the LanceDB Java SDK as a dependency in your project. It specifies the `groupId`, `artifactId`, and `version` for the `lance-core` library, enabling access to LanceDB functionalities.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_0

LANGUAGE: Shell
CODE:
```
<dependency>
    <groupId>com.lancedb</groupId>
    <artifactId>lance-core</artifactId>
    <version>0.18.0</version>
</dependency>
```

----------------------------------------

TITLE: Define PyArrow Schema for Lance Dataset (Python)
DESCRIPTION: This snippet defines a PyArrow schema required by Lance to understand the structure of the data being written. It specifies that the dataset will contain a single field named 'input_ids', which will store tokenized data as 64-bit integers.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_3

LANGUAGE: python
CODE:
```
schema = pa.schema([
    pa.field("input_ids", pa.int64())
])
```

----------------------------------------

TITLE: Add Columns to LanceDB Dataset in Java
DESCRIPTION: Demonstrates how to add new columns to a LanceDB dataset. This can be done either by providing SQL expressions to derive new column values or by defining a new Arrow Schema for the dataset, allowing for flexible schema evolution.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_6

LANGUAGE: java
CODE:
```
void addColumnsByExpressions() {
    String datasetPath = ""; // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator()) {
        try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
            SqlExpressions sqlExpressions = new SqlExpressions.Builder().withExpression("double_id", "id * 2").build();
            dataset.addColumns(sqlExpressions, Optional.empty());
        }
    }
}
```

LANGUAGE: java
CODE:
```
void addColumnsBySchema() {
  String datasetPath = ""; // specify a path point to a dataset
  try (BufferAllocator allocator = new RootAllocator()) {
    try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
      SqlExpressions sqlExpressions = new SqlExpressions.Builder().withExpression("double_id", "id * 2").build();
      dataset.addColumns(new Schema(
          Arrays.asList(
              Field.nullable("id", new ArrowType.Int(32, true)),
              Field.nullable("name", new ArrowType.Utf8()),
              Field.nullable("age", new ArrowType.Int(32, true)))), Optional.empty());
    }
  }
}
```

----------------------------------------

TITLE: Write Processed Data to Lance Dataset (Python)
DESCRIPTION: This final step uses the 'lance.write_dataset' function to persist the processed and tokenized data to disk as a Lance dataset. It takes the RecordBatchReader, the desired output file path, and the defined schema as arguments, completing the dataset creation process.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_5

LANGUAGE: python
CODE:
```
# Write the dataset to disk
lance.write_dataset(
    reader,
    "wikitext_500K.lance",
    schema
)
```

----------------------------------------

TITLE: Create a Vector Index on a Lance Dataset in Rust
DESCRIPTION: Demonstrates how to create a vector index on a specified column (e.g., 'embeddings') within a Lance Dataset. It configures vector index parameters like the number of partitions and sub-vectors, noting potential alignment requirements.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_4

LANGUAGE: rust
CODE:
```
use ::lance::index::vector::VectorIndexParams;

let params = VectorIndexParams::default();
params.num_partitions = 256;
params.num_sub_vectors = 16;

// this will Err if list_size(embeddings) / num_sub_vectors does not meet simd alignment
dataset.create_index(&["embeddings"], IndexType::Vector, None, &params, true).await;
```

----------------------------------------

TITLE: Load Query Data with Pandas
DESCRIPTION: This snippet imports the pandas library and loads query performance data from a CSV file named 'query.csv' into a DataFrame. This DataFrame will be used for subsequent analysis and visualization.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_0

LANGUAGE: python
CODE:
```
import pandas as pd
df = pd.read_csv("query.csv")
```

----------------------------------------

TITLE: Query Lance Dataset with Pandas
DESCRIPTION: Illustrates how to convert a Lance dataset to a PyArrow Table and then to a Pandas DataFrame for easy data manipulation and analysis using familiar Pandas operations.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_4

LANGUAGE: python
CODE:
```
df = dataset.to_table().to_pandas()
df
```

----------------------------------------

TITLE: Lance Manifest Protobuf Message Reference
DESCRIPTION: References the Protobuf message definition for the Manifest file, which encapsulates the metadata for a specific version of a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_1

LANGUAGE: APIDOC
CODE:
```
proto.message.Manifest
```

----------------------------------------

TITLE: Define Tokenization Function (Python)
DESCRIPTION: This function takes a single sample from a Hugging Face dataset and a specified field name (e.g., 'text'). It uses the pre-initialized tokenizer to convert the text content of that field into 'input_ids', which are numerical representations of tokens.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_1

LANGUAGE: python
CODE:
```
def tokenize(sample, field='text'):
    return tokenizer(sample[field])['input_ids']
```

----------------------------------------

TITLE: Implement CLIP Loss Function and Forward Pass Utilities
DESCRIPTION: This snippet provides utility functions for training a CLIP model. The `loss_fn` calculates the contrastive loss between image and text embeddings based on the CLIP paper, using logits, image similarity, and text similarity. The `forward` function performs a single forward pass, moving inputs to the GPU, and obtaining image and text embeddings using the defined encoder and head modules.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_6

LANGUAGE: python
CODE:
```
def loss_fn(img_embed, text_embed, temperature=0.2):
    """
    https://arxiv.org/abs/2103.00020/
    """
    # Calculate logits, image similarity and text similarity
    logits = (text_embed @ img_embed.T) / temperature
    img_sim = img_embed @ img_embed.T
    text_sim = text_embed @ text_embed.T
    # Calculate targets by taking the softmax of the similarities
    targets = F.softmax(
        (img_sim + text_sim) / 2 * temperature, dim=-1
    )
    img_loss = (-targets.T * nn.LogSoftmax(dim=-1)(logits.T)).sum(1)
    text_loss = (-targets * nn.LogSoftmax(dim=-1)(logits)).sum(1)
    return (img_loss + text_loss) / 2.0

def forward(img, caption):
    # Transfer to device
    img = img.to('cuda')
    for k, v in caption.items():
        caption[k] = v.to('cuda')

    # Get embeddings for both img and caption
    img_embed = img_head(img_encoder(img))
    text_embed = text_head(text_encoder(caption))

    return img_embed, text_embed
```

----------------------------------------

TITLE: Read Data from Lance Dataset
DESCRIPTION: Shows how to open and read a Lance dataset from a specified URI. It asserts that the returned object is a PyArrow Dataset, confirming seamless integration with the Apache Arrow ecosystem.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_3

LANGUAGE: python
CODE:
```
dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)
```

----------------------------------------

TITLE: Globally Set Object Store Timeout (Bash)
DESCRIPTION: Demonstrates how to set a global timeout for object store operations using an environment variable. This configuration applies to all subsequent Lance operations that interact with object storage.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_0

LANGUAGE: bash
CODE:
```
export TIMEOUT=60s
```

----------------------------------------

TITLE: Lance File Format Version Details
DESCRIPTION: This table provides a comprehensive overview of the Lance file format versions, including their compatibility, features, and stability status. It details the breaking changes, new functionalities introduced in each version, and aliases for common use cases.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_5

LANGUAGE: APIDOC
CODE:
```
Version: 0.1
  Minimal Lance Version: Any
  Maximum Lance Version: Any
  Description: This is the initial Lance format.

Version: 2.0
  Minimal Lance Version: 0.16.0
  Maximum Lance Version: Any
  Description: Rework of the Lance file format that removed row groups and introduced null support for lists, fixed size lists, and primitives

Version: 2.1 (unstable)
  Minimal Lance Version: None
  Maximum Lance Version: Any
  Description: Enhances integer and string compression, adds support for nulls in struct fields, and improves random access performance with nested fields.

Version: legacy
  Minimal Lance Version: N/A
  Maximum Lance Version: N/A
  Description: Alias for 0.1

Version: stable
  Minimal Lance Version: N/A
  Maximum Lance Version: N/A
  Description: Alias for the latest stable version (currently 2.0)

Version: next
  Minimal Lance Version: N/A
  Maximum Lance Version: N/A
  Description: Alias for the latest unstable version (currently 2.1)
```

----------------------------------------

TITLE: Connect LanceDB to S3-Compatible Stores (e.g., MinIO)
DESCRIPTION: Illustrates how to configure LanceDB to connect to S3-compatible storage solutions like MinIO. This requires specifying both the `region` and `endpoint` within the `storage_options` parameter to direct LanceDB to the custom S3 endpoint, enabling connectivity beyond AWS S3.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_5

LANGUAGE: python
CODE:
```
import lance
ds = lance.dataset(
    "s3://bucket/path",
    storage_options={
        "region": "us-east-1",
        "endpoint": "http://minio:9000",
    }
)
```

----------------------------------------

TITLE: Load and parse Flickr8k token file annotations
DESCRIPTION: This code reads the 'Flickr8k.token.txt' file, which contains image annotations. It then processes each line to extract the image file name, a unique caption number, and the caption text itself, storing them as structured tuples for further processing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_1

LANGUAGE: python
CODE:
```
with open(captions, "r") as fl:
    annotations = fl.readlines()

# Converts the annotations where each element of this list is a tuple consisting of image file name, caption number and caption itself
annotations = list(map(lambda x: tuple([*x.split('\t')[0].split('#'), x.split('\t')[1]]), annotations))
```

----------------------------------------

TITLE: Lance File Footer and Overall Layout Specification
DESCRIPTION: Provides a detailed byte-level specification of the .lance file format, including the arrangement of data pages, column metadata, offset tables, and the final footer. It outlines alignment requirements and the structure of various fields.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_4

LANGUAGE: APIDOC
CODE:
```
// Note: the number of buffers (BN) is independent of the number of columns (CN)
//       and pages.
//
//       Buffers often need to be aligned.  64-byte alignment is common when
//       working with SIMD operations.  4096-byte alignment is common when
//       working with direct I/O.  In order to ensure these buffers are aligned
//       writers may need to insert padding before the buffers.
//
//       If direct I/O is required then most (but not all) fields described
//       below must be sector aligned.  We have marked these fields with an
//       asterisk for clarity.  Readers should assume there will be optional
//       padding inserted before these fields.
//
//       All footer fields are unsigned integers written with  little endian
//       byte order.
//
// ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
```

----------------------------------------

TITLE: LanceDB Conflict Resolution Process
DESCRIPTION: Outlines the commit process in LanceDB, detailing how writers handle concurrent modifications, create transaction files for conflict detection, and retry commits after checking for compatibility with successful writes.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_11

LANGUAGE: APIDOC
CODE:
```
Commit Process:
  1. Writer finishes writing all data files.
  2. Writer creates a transaction file in _transactions directory.
     - Purpose: detect conflicts, re-build manifest during retries.
  3. Check for new commits since writer started.
     - If conflicts detected (via transaction files), abort commit.
  4. Build manifest and attempt to commit to next version.
     - If commit fails due to concurrent write, go back to step 3.

Conflict Detection:
  - Conservative approach: assume conflict if transaction file is missing or has unknown operation.
```

----------------------------------------

TITLE: Lance Dataset Directory Structure
DESCRIPTION: Illustrates the typical organization of a Lance dataset within a directory, detailing the location of data files, version manifests, secondary indices, and deletion files.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_0

LANGUAGE: plaintext
CODE:
```
/path/to/dataset:
    data/*.lance  -- Data directory
    _versions/*.manifest -- Manifest file for each dataset version.
    _indices/{UUID-*}/index.idx -- Secondary index, each index per directory.
    _deletions/*.{arrow,bin} -- Deletion files, which contain ids of rows
      that have been deleted.
```

----------------------------------------

TITLE: Define PyArrow Schema for Lance Dataset in Python
DESCRIPTION: This Python code defines a PyArrow schema for the Lance dataset. It specifies the data types for `image_id` (string), `image` (binary), and `captions` (list of strings), ensuring proper data structure and type consistency for the dataset when written to Lance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_4

LANGUAGE: python
CODE:
```
schema = pa.schema([
    pa.field("image_id", pa.string()),
    pa.field("image", pa.binary()),
    pa.field("captions", pa.list_(pa.string(), -1)),
])
```

----------------------------------------

TITLE: Define image augmentations for CLIP model training
DESCRIPTION: This snippet defines a `torchvision.transforms.Compose` object for basic image augmentations applied during CLIP model training. It includes converting images to tensors, resizing them to a consistent shape, and normalizing pixel values to stabilize the training process.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_4

LANGUAGE: python
CODE:
```
train_augments = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Resize(Config.img_size),
        transforms.Normalize([0.5], [0.5]),
    ]
)
```

----------------------------------------

TITLE: Generate GIST-1M Database Vectors Lance Dataset
DESCRIPTION: Uses the `datagen.py` script to convert GIST-1M base vectors into a Lance dataset. This dataset will serve as the primary data source for indexing and querying in the GIST-1M benchmark.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_9

LANGUAGE: sh
CODE:
```
./datagen.py ./gist/gist_base.fvecs ./.lancedb/gist1m.lance -g 1024 -m 50000 -d 960
```

----------------------------------------

TITLE: Set Object Store Timeout for a Single Dataset (Python)
DESCRIPTION: Shows how to specify storage options, such as a timeout, for a specific Lance dataset using the `storage_options` parameter in `lance.dataset`. This allows fine-grained control over individual dataset configurations without affecting global settings.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_1

LANGUAGE: python
CODE:
```
import lance
ds = lance.dataset("s3://path", storage_options={"timeout": "60s"})
```

----------------------------------------

TITLE: Connect LanceDB to Google Cloud Storage
DESCRIPTION: This Python snippet demonstrates how to connect a LanceDB dataset to Google Cloud Storage using `storage_options` to specify service account credentials. It provides an alternative to setting the `GOOGLE_SERVICE_ACCOUNT` environment variable.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_7

LANGUAGE: python
CODE:
```
import lance
ds = lance.dataset(
    "gs://my-bucket/my-dataset",
    storage_options={
        "service_account": "path/to/service-account.json",
    }
)
```

----------------------------------------

TITLE: Read and Write Lance Data with Ray and Pandas
DESCRIPTION: This snippet demonstrates how to write data to the Lance format using Ray's data sink (`ray.data.Dataset.write_lance`) and subsequently read it back using both the native Lance API (`lance.dataset`) and Ray's data source (`ray.data.read_lance`). It includes assertions to verify data integrity after read/write operations.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/ray.md#_snippet_0

LANGUAGE: python
CODE:
```
import ray
import pandas as pd

ray.init()

data = [
    {"id": 1, "name": "alice"},
    {"id": 2, "name": "bob"},
    {"id": 3, "name": "charlie"}
]
ray.data.from_items(data).write_lance("./alice_bob_and_charlie.lance")

# It can be read via lance directly
df = (
    lance.
    dataset("./alice_bob_and_charlie.lance")
    .to_table()
    .to_pandas()
    .sort_values(by=["id"])
    .reset_index(drop=True)
)
assert df.equals(pd.DataFrame(data)), "{} != {}".format(
    df, pd.DataFrame(data)
)

# Or via Ray.data.read_lance
ray_df = (
    ray.data.read_lance("./alice_bob_and_charlie.lance")
    .to_pandas()
    .sort_values(by=["id"])
    .reset_index(drop=True)
)
assert df.equals(ray_df)
```

----------------------------------------

TITLE: Load PyTorch Model State Dictionary from Lance Dataset (Python)
DESCRIPTION: This function reads all model weights from a specified Lance dataset file and constructs an OrderedDict suitable as a PyTorch model state dictionary. It iterates through each weight, converting it using _load_weight, and places it on the specified device. This function assumes all weights can fit into memory; large models may cause memory errors.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_5

LANGUAGE: python
CODE:
```
def _load_state_dict(file_name: str, version: int = 1, map_location=None) -> OrderedDict:
    """Reads the model weights from lance file and returns a model state dict
    If the model weights are too large, this function will fail with a memory error.

    Args:
        file_name (str): Lance model name
        version (int): Version of the model to load
        map_location (str): Device to load the model on

    Returns:
        OrderedDict: Model state dict
    """
    ds = lance.dataset(file_name, version=version)
    weights = ds.take([x for x in range(ds.count_rows())]).to_pylist()
    state_dict = OrderedDict()

    for weight in weights:
        state_dict[weight["name"]] = _load_weight(weight).to(map_location)

    return state_dict
```

----------------------------------------

TITLE: Load Data Chunk from Lance Dataset by Indices
DESCRIPTION: This utility function, `from_indices`, efficiently loads specific elements from a Lance dataset based on a list of provided indices. It takes a Lance dataset object and a list of integer indices, then retrieves the corresponding rows. The function processes these rows to extract only the 'input_ids' from each, returning them as a list of token IDs, which is crucial for preparing data chunks.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_1

LANGUAGE: python
CODE:
```
def from_indices(dataset, indices):
    """Load the elements on given indices from the dataset"""
    chunk = dataset.take(indices).to_pylist()
    chunk = list(map(lambda x: x['input_ids'], chunk))
    return chunk
```

----------------------------------------

TITLE: Run LanceDB Vector Search Recall Test
DESCRIPTION: Defines run_test, a comprehensive function for evaluating LanceDB's vector search recall. It generates ground truth, writes data to a temporary LanceDB dataset, creates an IVF_PQ index, and performs nearest neighbor queries with varying nprobes and refine_factor to calculate recall for both in-sample and out-of-sample queries.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_2

LANGUAGE: python
CODE:
```
def run_test(
    data,
    query,
    metric,
    num_partitions=256,
    num_sub_vectors=8,
    nprobes_list=[1, 2, 5, 10, 16],
    refine_factor_list=[1, 2, 5, 10, 20],
):
    in_sample = data[random.sample(range(data.shape[0]), 1000), :]
    # ground truth
    print("generating gt")

    gt = knn(query, data, metric, 10)
    gt_in_sample = knn(in_sample, data, metric, 10)

    print("generated gt")

    with tempfile.TemporaryDirectory() as d:
        write_lance(d, data)
        ds = lance.dataset(d)

        for q, target in zip(tqdm(in_sample, desc="checking brute force"), gt_in_sample):
            res = ds.to_table(nearest={
                "column": "vec",
                "q": q,
                "k": 10,
                "metric": metric,
            }, columns=["id"])
            assert len(np.intersect1d(res["id"].to_numpy(), target)) == 10

        ds = ds.create_index("vec", "IVF_PQ", metric=metric, num_partitions=num_partitions, num_sub_vectors=num_sub_vectors)

        recall_data = []
        for nprobes in nprobes_list:
            for refine_factor in refine_factor_list:
                hits = 0
                # check that brute force impl is correct
                for q, target in zip(tqdm(query, desc=f"out of sample, nprobes={nprobes}, refine={refine_factor}"), gt):
                    res = ds.to_table(nearest={
                        "column": "vec",
                        "q": q,
                        "k": 10,
                        "nprobes": nprobes,
                        "refine_factor": refine_factor,
                    }, columns=["id"])["id"].to_numpy()
                    hits += len(np.intersect1d(res, target))
                recall_data.append([
                    "out_of_sample",
                    nprobes,
                    refine_factor,
                    hits / 10 / len(gt),
                ])
                # check that brute force impl is correct
                for q, target in zip(tqdm(in_sample, desc=f"in sample nprobes={nprobes}, refine={refine_factor}"), gt_in_sample):
                    res = ds.to_table(nearest={
                        "column": "vec",
                        "q": q,
                        "k": 10,
                        "nprobes": nprobes,
                        "refine_factor": refine_factor,
                    }, columns=["id"])["id"].to_numpy()
                    hits += len(np.intersect1d(res, target))
                recall_data.append([
                    "in_sample",
                    nprobes,
                    refine_factor,
                    hits / 10 / len(gt_in_sample),
                ])
    return recall_data
```

----------------------------------------

TITLE: Stream PyArrow RecordBatches to Lance Dataset
DESCRIPTION: Shows how to write a Lance dataset from an iterator of `pyarrow.RecordBatch` objects. This method is ideal for large datasets that cannot be fully loaded into memory, requiring a `pyarrow.Schema` to be provided.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_1

LANGUAGE: python
CODE:
```
def producer() -> Iterator[pa.RecordBatch]:
    """An iterator of RecordBatches."""
    yield pa.RecordBatch.from_pylist([{"name": "Alice", "age": 20}])
    yield pa.RecordBatch.from_pylist([{"name": "Bob", "age": 30}])

schema = pa.schema([
    ("name", pa.string()),
    ("age", pa.int32()),
])

ds = lance.write_dataset(producer(),
                         "./alice_and_bob.lance",
                         schema=schema, mode="overwrite")
print(ds.count_rows())  # Output: 2
```

----------------------------------------

TITLE: LanceDB External Manifest Store Commit Operations
DESCRIPTION: Details the four-step commit process when using an external manifest store for concurrent writes in LanceDB, involving staging manifests, committing paths to the external store, and finalizing the manifest in object storage.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_12

LANGUAGE: APIDOC
CODE:
```
External Store Commit Process:
  1. PUT_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid}
     - Action: Stage new manifest in object store under unique path.
  2. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest-{uuid}
     - Action: Commit staged manifest path to external KV store.
     - Note: Commit is effectively complete after this step.
  3. COPY_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid} mydataset.lance/_versions/{version}.manifest
     - Action: Copy staged manifest to final path.
  4. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest
     - Action: Update external store to point to final manifest.
```

----------------------------------------

TITLE: Write PyArrow Table to Lance Dataset
DESCRIPTION: Demonstrates how to write a `pyarrow.Table` to a Lance dataset using the `lance.write_dataset` function. This is suitable for datasets that can be fully loaded into memory.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_0

LANGUAGE: python
CODE:
```
import lance
import pyarrow as pa

table = pa.Table.from_pylist([{"name": "Alice", "age": 20},
                              {"name": "Bob", "age": 30}])
ds = lance.write_dataset(table, "./alice_and_bob.lance")
```

----------------------------------------

TITLE: Lance DataFragment Protobuf Message Reference
DESCRIPTION: References the Protobuf message definition for DataFragment, which represents a logical chunk of data within a Lance dataset. It can include one or more DataFiles and an optional DeletionFile.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_2

LANGUAGE: APIDOC
CODE:
```
proto.message.DataFragment
```

----------------------------------------

TITLE: Import Libraries for LanceDB Vector Search Testing
DESCRIPTION: Imports necessary Python libraries for numerical operations (numpy), temporary file handling (tempfile), data manipulation (pandas), plotting (seaborn, matplotlib), and LanceDB specific functionalities (lance, _lib). These imports provide the foundational tools for the vector search tests.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_1

LANGUAGE: python
CODE:
```
from _lib import knn, write_lance, _get_nyt_vectors

import numpy as np
import tempfile
import random
import lance
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from tqdm.auto import tqdm
```

----------------------------------------

TITLE: Generate SIFT-1M Database Vectors Lance Dataset
DESCRIPTION: Uses the `datagen.py` script to convert SIFT-1M base vectors into a Lance dataset. This dataset will serve as the primary data source for indexing and querying in the SIFT-1M benchmark.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_3

LANGUAGE: sh
CODE:
```
./datagen.py ./sift/sift_base.fvecs ./.lancedb/sift1m.lance -d 128
```

----------------------------------------

TITLE: Compact LanceDB Dataset Files with Python
DESCRIPTION: This Python code demonstrates how to compact data files within a LanceDB dataset using the `compact_files` method. It specifies a `target_rows_per_fragment` to optimize file count and can remove soft-deleted rows, improving query performance. Note that compaction creates a new table version and invalidates old row addresses for indexing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_21

LANGUAGE: python
CODE:
```
import lance

dataset = lance.dataset("./alice_and_bob.lance")
dataset.optimize.compact_files(target_rows_per_fragment=1024 * 1024)
```

----------------------------------------

TITLE: Prepare PyTorch Model State Dict for LanceDB Saving
DESCRIPTION: This utility function processes a PyTorch model's `state_dict`, iterating through each parameter. It flattens the parameter's tensor, extracts its name and original shape, and then packages this information into a PyArrow `RecordBatch`. This prepares the model weights for efficient storage in a LanceDB dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_2

LANGUAGE: python
CODE:
```
def _save_model_writer(state_dict):
    """Yields a RecordBatch for each parameter in the model state dict"""
    for param_name, param in state_dict.items():
        param_shape = list(param.size())
        param_value = param.flatten().tolist()
        yield pa.RecordBatch.from_arrays(
            [
                pa.array(
                    [param_name],
                    pa.string(),
                ),
                pa.array(
                    [param_value],
                    pa.list_(pa.float64(), -1),
                ),
                pa.array(
                    [param_shape],
                    pa.list_(pa.int64(), -1),
                ),
            ],
            ["name", "value", "shape"],
        )
```

----------------------------------------

TITLE: Create PyTorch DataLoader from LanceDataset (Safe)
DESCRIPTION: This snippet demonstrates how to create a multiprocessing-safe PyTorch DataLoader using `SafeLanceDataset` and `get_safe_loader`. It explicitly uses the 'spawn' method to avoid fork-safety issues that can arise when LanceDB's internal multithreading interacts with Python's multiprocessing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_2

LANGUAGE: python
CODE:
```
from lance.torch.data import SafeLanceDataset, get_safe_loader

dataset = SafeLanceDataset(temp_lance_dataset)
# use spawn method to avoid fork-safe issue
loader = get_safe_loader(
    dataset,
    num_workers=2,
    batch_size=16,
    drop_last=False,
)

total_samples = 0
for batch in loader:
    total_samples += batch["id"].shape[0]
```

----------------------------------------

TITLE: Generate SIFT-1M Ground Truth Lance Dataset
DESCRIPTION: Generates a ground truth Lance dataset for SIFT-1M using the `gt.py` script. This dataset is essential for evaluating the recall of the benchmark queries against known correct results.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_4

LANGUAGE: sh
CODE:
```
./gt.py ./.lancedb/sift1m.lance -o ./.lancedb/ground_truth.lance
```

----------------------------------------

TITLE: Lindera User Dictionary Configuration File (config.yml)
DESCRIPTION: YAML configuration for Lindera, defining the segmenter mode and the path to the dictionary. This file, typically named `config.yml`, can be placed in the model's root directory or specified via the `LINDERA_CONFIG_PATH` environment variable. The `kind` field is not supported in LanceDB's context.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_6

LANGUAGE: yaml
CODE:
```
segmenter:
    mode: "normal"
    dictionary:
        # Note: in lance, the `kind` field is not supported. You need to specify the model path using the `path` field instead.
        path: /path/to/lindera/ipadic/main
```

----------------------------------------

TITLE: Test LanceDB Vector Search with Random Data (L2 Metric)
DESCRIPTION: Demonstrates running the run_test function with randomly generated data (100,000 vectors, 64 dimensions) and queries, using the L2 (Euclidean) distance metric. It then visualizes the recall results using make_plot.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_4

LANGUAGE: python
CODE:
```
# test randomly generated data
data = np.random.standard_normal((100000, 64))
query = np.random.standard_normal((1000, 64))

recall_data = run_test(
    data,
    query,
    "L2",
)

make_plot(recall_data)
```

----------------------------------------

TITLE: Lance ColumnMetadata Protobuf Message Reference
DESCRIPTION: References the Protobuf message definition for ColumnMetadata, which is used to describe the encoding and properties of individual columns within a .lance file.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_3

LANGUAGE: APIDOC
CODE:
```
proto.message.ColumnMetadata
```

----------------------------------------

TITLE: Generate GIST-1M Query Vectors Lance Dataset
DESCRIPTION: Converts GIST-1M query vectors into a Lance dataset using `datagen.py`. These vectors will be used to perform similarity searches against the indexed database during the benchmark.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_10

LANGUAGE: sh
CODE:
```
./datagen.py ./gist/gist_query.fvecs ./.lancedb/gist_query.lance -g 1024 -m 50000 -d 960 -n 1000
```

----------------------------------------

TITLE: Test LanceDB Vector Search with Random Data (Cosine Metric)
DESCRIPTION: Shows how to execute the run_test function using randomly generated data and queries, but this time employing the cosine similarity metric. The recall performance is subsequently plotted using make_plot.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_5

LANGUAGE: python
CODE:
```
# test randomly generated data -- cosine
data = np.random.standard_normal((100000, 64))
query = np.random.standard_normal((1000, 64))

recall_data = run_test(
    data,
    query,
    "cosine",
)

make_plot(recall_data)
```

----------------------------------------

TITLE: Load PyTorch Model with Weights from Lance Dataset (Python)
DESCRIPTION: This high-level function facilitates loading weights directly into a given PyTorch model from a Lance dataset. It internally calls _load_state_dict to retrieve the complete state dictionary and then applies it to the provided model instance. This simplifies the process of restoring a model's state from a Lance-backed storage.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_6

LANGUAGE: python
CODE:
```
def load_model(
    model: torch.nn.Module, file_name: str, version: int = 1, map_location=None
):
    """Loads the model weights from lance file and sets them to the model

    Args:
        model (torch.nn.Module): PyTorch model
        file_name (str): Lance model name
        version (int): Version of the model to load
        map_location (str): Device to load the model on
    """
    state_dict = _load_state_dict(file_name, version=version, map_location=map_location)
    model.load_state_dict(state_dict)
```

----------------------------------------

TITLE: Connect LanceDB to Azure Blob Storage
DESCRIPTION: This Python snippet illustrates how to connect a LanceDB dataset to Azure Blob Storage. It shows how to pass `account_name` and `account_key` directly via `storage_options`, offering an alternative to environment variables.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_9

LANGUAGE: python
CODE:
```
import lance
ds = lance.dataset(
    "az://my-container/my-dataset",
    storage_options={
        "account_name": "some-account",
        "account_key": "some-key",
    }
)
```

----------------------------------------

TITLE: Default Lance Language Model Home Directory
DESCRIPTION: This snippet illustrates the default directory path where LanceDB stores language models if the LANCE_LANGUAGE_MODEL_HOME environment variable is not explicitly set by the user.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_0

LANGUAGE: bash
CODE:
```
${system data directory}/lance/language_models
```

----------------------------------------

TITLE: Perform Random Row Access in Lance Dataset
DESCRIPTION: This Python snippet demonstrates Lance's capability for fast random access to individual rows using the `take()` method. This feature is crucial for workflows like random sampling, shuffling in ML training, and building secondary indices for enhanced query performance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_20

LANGUAGE: python
CODE:
```
data = ds.take([1, 100, 500], columns=["image", "label"])
```

----------------------------------------

TITLE: Configure AWS Credentials for LanceDB S3 Dataset
DESCRIPTION: Demonstrates how to pass AWS access key ID, secret access key, and session token directly to the `storage_options` parameter when initializing a LanceDB dataset from an S3 path. This method provides explicit credential management for S3 access, overriding environment variables if set.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_3

LANGUAGE: python
CODE:
```
import lance
ds = lance.dataset(
    "s3://bucket/path",
    storage_options={
        "access_key_id": "my-access-key",
        "secret_access_key": "my-secret-key",
        "session_token": "my-session-token",
    }
)
```

----------------------------------------

TITLE: Create Scalar Index with Jieba Tokenizer in Python
DESCRIPTION: Python code demonstrating how to create a scalar index on a 'text' field using the 'INVERTED' index type, specifying 'jieba/default' as the base tokenizer for text processing within LanceDB.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_2

LANGUAGE: python
CODE:
```
ds.create_scalar_index("text", "INVERTED", base_tokenizer="jieba/default")
```

----------------------------------------

TITLE: Add and Populate Columns with SQL Expressions in Lance
DESCRIPTION: Illustrates adding and populating new columns in a Lance dataset using SQL expressions. This method allows defining column values based on existing columns or literal values, enabling data backfill within a single operation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_1

LANGUAGE: python
CODE:
```
table = pa.table({"name": pa.array(["Alice", "Bob", "Carla"])})
dataset = lance.write_dataset(table, "names")
dataset.add_columns({
    "hash": "sha256(name)",
    "status": "'active'",
})
print(dataset.to_table().to_pandas())
```

----------------------------------------

TITLE: Perform Nearest Neighbor Vector Search on Lance Dataset
DESCRIPTION: Demonstrates how to perform nearest neighbor searches on a Lance dataset with a vector index. It samples query vectors using DuckDB and then retrieves the top 10 similar vectors for each query using Lance's `nearest` functionality, showcasing its vector search capabilities.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_9

LANGUAGE: python
CODE:
```
# Get top 10 similar vectors
import duckdb

dataset = lance.dataset(uri)

# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
      for q in query_vectors]
```

----------------------------------------

TITLE: Convert Parquet to Lance Dataset
DESCRIPTION: Demonstrates how to convert a Pandas DataFrame to a PyArrow Table, save it as a Parquet file, and then convert the Parquet dataset into a Lance dataset. This showcases Lance's compatibility with existing data formats and its ease of use for data migration.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_2

LANGUAGE: python
CODE:
```
import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")
```

----------------------------------------

TITLE: Define PyArrow Schema with Lance Encoding Metadata
DESCRIPTION: This Python snippet demonstrates how to define a PyArrow schema for a LanceDB table, applying column-level encoding configurations. It shows how to use PyArrow field metadata to specify compression algorithms, compression levels, structural encoding strategies, and packed memory layout for string columns.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_7

LANGUAGE: python
CODE:
```
import pyarrow as pa

schema = pa.schema([
    pa.field(
        "compressible_strings",
        pa.string(),
        metadata={
            "lance-encoding:compression": "zstd",
            "lance-encoding:compression-level": "3",
            "lance-encoding:structural-encoding": "miniblock",
            "lance-encoding:packed": "true"
        }
    )
])
```

----------------------------------------

TITLE: Configure Seaborn Plot Style
DESCRIPTION: This snippet imports the seaborn library and sets the default plot style to 'darkgrid'. This improves the visual aesthetics of subsequent plots generated using seaborn.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_1

LANGUAGE: python
CODE:
```
import seaborn as sns
sns.set_style("darkgrid")
```

----------------------------------------

TITLE: Generate SIFT-1M Query Vectors Lance Dataset
DESCRIPTION: Converts SIFT-1M query vectors into a Lance dataset using `datagen.py`. These vectors will be used to perform similarity searches against the indexed database during the benchmark.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_5

LANGUAGE: sh
CODE:
```
./datagen.py ./sift/sift_query.fvecs ./.lancedb/sift_query.lance -d 128 -n 1000
```

----------------------------------------

TITLE: Convert SIFT1M Dataset to Lance for Vector Search
DESCRIPTION: Loads the SIFT1M dataset from a binary file, converts its raw vector data into a NumPy array, and then transforms it into a Lance table using `vec_to_table`. The dataset is then written to a Lance file, optimized for vector search with specific row group and file size settings.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_7

LANGUAGE: python
CODE:
```
import lance
from lance.vector import vec_to_table
import numpy as np
import struct

nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
    dd = dict(zip(range(nvecs), data))

table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)
```

----------------------------------------

TITLE: Load Entire Lance Dataset into Memory
DESCRIPTION: This Python snippet shows how to load an entire Lance dataset into an in-memory table using the `to_table()` method. This approach is straightforward and suitable for datasets that can comfortably fit within available memory.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_13

LANGUAGE: python
CODE:
```
table = ds.to_table()
```

----------------------------------------

TITLE: Lance SQL Type to Apache Arrow Type Mapping
DESCRIPTION: This table provides a comprehensive mapping between SQL data types supported by Lance and their corresponding Apache Arrow data types. It details the internal storage format for various data representations, crucial for understanding data compatibility and performance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_19

LANGUAGE: APIDOC
CODE:
```
| SQL type | Arrow type |
|----------|------------|
| `boolean` | `Boolean` |
| `tinyint` / `tinyint unsigned` | `Int8` / `UInt8` |
| `smallint` / `smallint unsigned` | `Int16` / `UInt16` |
| `int` or `integer` / `int unsigned` or `integer unsigned` | `Int32` / `UInt32` |
| `bigint` / `bigint unsigned` | `Int64` / `UInt64` |
| `float` | `Float32` |
| `double` | `Float64` |
| `decimal(precision, scale)` | `Decimal128` |
| `date` | `Date32` |
| `timestamp` | `Timestamp` (1) |
| `string` | `Utf8` |
| `binary` | `Binary` |
```

----------------------------------------

TITLE: Visualize LanceDB Vector Search Recall Heatmap
DESCRIPTION: Defines make_plot, a utility function to visualize the recall data generated by run_test. It takes the recall data (a list of lists) and converts it into a pandas DataFrame, then uses seaborn to generate heatmaps showing recall across different nprobes and refine_factor values for various test cases.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_3

LANGUAGE: python
CODE:
```
def make_plot(recall_data):
    df = pd.DataFrame(recall_data, columns=["case", "nprobes", "refine_factor", "recall"])

    num_cases = len(df["case"].unique())
    (fig, axs) = plt.subplots(1, 2, figsize=(16, 8))

    for case, ax in zip(df["case"].unique(), axs):
        current_case = df[df["case"] == case]
        sns.heatmap(
            current_case.drop(columns=["case"]).set_index(["nprobes", "refine_factor"])["recall"].unstack(),
            annot=True,
            ax=ax,
        ).set(title=f"Recall -- {case}")
```

----------------------------------------

TITLE: Count unique video titles in dataset
DESCRIPTION: Converts the loaded dataset to a Pandas DataFrame and counts the number of unique video titles. This provides an overview of the diversity and scope of the video content within the dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_2

LANGUAGE: python
CODE:
```
data.to_pandas().title.nunique()
```

----------------------------------------

TITLE: Describe Median Latency by Refine Factor
DESCRIPTION: This snippet groups the DataFrame by the 'refine_factor' column and calculates descriptive statistics for the '50%' (median response time) column. This provides an understanding of latency variations across different refinement factors.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_6

LANGUAGE: python
CODE:
```
df.groupby("refine_factor")["50%"].describe()
```

----------------------------------------

TITLE: Utility functions to load images and captions from Lance dataset
DESCRIPTION: These two Python functions, `load_image` and `load_caption`, facilitate loading data from a Lance dataset. `load_image` converts byte-formatted images to a usable image format using numpy and OpenCV, while `load_caption` extracts the longest caption associated with an image, assuming it contains the most descriptive information.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_2

LANGUAGE: python
CODE:
```
def load_image(ds, idx):
    # Utility function to load an image at an index and convert it from bytes format to img format
    raw_img = ds.take([idx], columns=['image']).to_pydict()
    raw_img = np.frombuffer(b''.join(raw_img['image']), dtype=np.uint8)
    img = cv2.imdecode(raw_img, cv2.IMREAD_COLOR)
    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
    return img

def load_caption(ds, idx):
    # Utility function to load an image's caption. Currently we return the longest caption of all
    captions = ds.take([idx], columns=['captions']).to_pydict()['captions'][0]
    return max(captions, key=len)
```

----------------------------------------

TITLE: Save PyTorch Model Weights to LanceDB with Versioning
DESCRIPTION: This function saves a PyTorch model's `state_dict` to a LanceDB file. It utilizes the `_save_model_writer` utility to format the data. The function supports both overwriting existing model weights or saving them as a new version within the Lance dataset, providing flexibility for model checkpoint management.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_3

LANGUAGE: python
CODE:
```
def save_model(state_dict: OrderedDict, file_name: str, version=False):
    """Saves a PyTorch model in lance file format

    Args:
        state_dict (OrderedDict): Model state dict
        file_name (str): Lance model name
        version (bool): Whether to save as a new version or overwrite the existing versions,
            if the lance file already exists
    """
    # Create a reader
    reader = pa.RecordBatchReader.from_batches(
        GLOBAL_SCHEMA, _save_model_writer(state_dict)
    )

    if os.path.exists(file_name):
        if version:
            # If we want versioning, we use the overwrite mode to create a new version
            lance.write_dataset(
                reader, file_name, schema=GLOBAL_SCHEMA, mode="overwrite"
            )
        else:
            # If we don't want versioning, we delete the existing file and write a new one
            shutil.rmtree(file_name)
            lance.write_dataset(reader, file_name, schema=GLOBAL_SCHEMA)
    else:
        # If the file doesn't exist, we write a new one
        lance.write_dataset(reader, file_name, schema=GLOBAL_SCHEMA)
```

----------------------------------------

TITLE: Protobuf Definition for Row ID Sequence Storage
DESCRIPTION: This protobuf oneof field defines how row ID sequences are stored. Small sequences are stored directly as `inline_sequence` bytes to avoid I/O overhead, while large sequences are referenced via an `external_file` path to optimize storage and retrieval.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_16

LANGUAGE: Protobuf
CODE:
```
oneof row_id_sequence {
    // Inline sequence
    bytes inline_sequence = 1;
    // External file reference
    string external_file = 2;
} // row_id_sequence
```

----------------------------------------

TITLE: Drop Columns in LanceDB Dataset
DESCRIPTION: This snippet demonstrates how to drop columns from a LanceDB dataset using the `lance.LanceDataset.drop_columns` method. This is a metadata-only operation, making it very fast. It also explains that physical data removal requires `lance.dataset.DatasetOptimizer.compact_files()` followed by `lance.LanceDataset.cleanup_old_versions()`.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_4

LANGUAGE: python
CODE:
```
table = pa.table({"id": pa.array([1, 2, 3]),
                 "name": pa.array(["Alice", "Bob", "Carla"])})
dataset = lance.write_dataset(table, "names", mode="overwrite")
dataset.drop_columns(["name"])
print(dataset.schema)
# id: int64
```

----------------------------------------

TITLE: Define CLIP Model Components (ImageEncoder, TextEncoder, Head) in PyTorch
DESCRIPTION: This snippet defines the core neural network modules for a CLIP-like model. ImageEncoder uses a pre-trained vision model (e.g., ResNet) to convert images to feature vectors. TextEncoder uses a pre-trained language model (e.g., BERT) for text embeddings. The Head module projects these features into a common embedding space using linear layers, GELU activation, dropout, and layer normalization.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_5

LANGUAGE: python
CODE:
```
class ImageEncoder(nn.Module):
    """Encodes the Image"""
    def __init__(self, model_name, pretrained = True):
        super().__init__()
        self.backbone = timm.create_model(
            model_name,
            pretrained=pretrained,
            num_classes=0,
            global_pool="avg"
        )

        for param in self.backbone.parameters():
            param.requires_grad = True

    def forward(self, img):
        return self.backbone(img)

class TextEncoder(nn.Module):
    """Encodes the Caption"""
    def __init__(self, model_name):
        super().__init__()

        self.backbone = AutoModel.from_pretrained(model_name)

        for param in self.backbone.parameters():
            param.requires_grad = True

    def forward(self, captions):
        output = self.backbone(**captions)
        return output.last_hidden_state[:, 0, :]

class Head(nn.Module):
    """Projects both into Embedding space"""
    def __init__(self, embedding_dim, projection_dim):
        super().__init__()
        self.projection = nn.Linear(embedding_dim, projection_dim)
        self.gelu = nn.GELU()
        self.fc = nn.Linear(projection_dim, projection_dim)

        self.dropout = nn.Dropout(0.3)
        self.layer_norm = nn.LayerNorm(projection_dim)

    def forward(self, x):
        projected = self.projection(x)
        x = self.gelu(projected)
        x = self.fc(x)
        x = self.dropout(x)
        x += projected

        return self.layer_norm(x)
```

----------------------------------------

TITLE: Retrieve Specific Records from a Lance Dataset in Rust
DESCRIPTION: Retrieves specific records from a Lance Dataset based on their indices and a projection. The result is a RecordBatch containing the requested data.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_3

LANGUAGE: rust
CODE:
```
let values: Result<RecordBatch> = dataset.take(&[200, 199, 39, 40, 100], &projection).await;
```

----------------------------------------

TITLE: Define PyArrow schema for storing PyTorch model weights in Lance
DESCRIPTION: This snippet defines a `pyarrow.Schema` named `GLOBAL_SCHEMA` specifically designed for storing PyTorch model weights within the Lance file format. The schema includes three fields: 'name' (string) for the weight's identifier, 'value' (list of float64) for the flattened weight tensor, and 'shape' (list of int64) to preserve the original dimensions for reconstruction.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_1

LANGUAGE: python
CODE:
```
GLOBAL_SCHEMA = pa.schema(
    [
        pa.field("name", pa.string()),
        pa.field("value", pa.list_(pa.float64(), -1)),
        pa.field("shape", pa.list_(pa.int64(), -1)) # Is a list with variable shape because weights can have any number of dims
    ]
)
```

----------------------------------------

TITLE: Create Lance ImageURIArray from URI List
DESCRIPTION: This snippet demonstrates how to initialize a `lance.arrow.ImageURIArray` from a list of image URIs. This array type is designed to store references to images in various storage systems (local, file, S3) for lazy loading, without validating or loading the images into memory immediately.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_3

LANGUAGE: python
CODE:
```
from lance.arrow import ImageURIArray

ImageURIArray.from_uris([
   "/tmp/image1.jpg",
   "file:///tmp/image2.jpg",
   "s3://example/image3.jpg"
])
# <lance.arrow.ImageURIArray object at 0x...>
# ['/tmp/image1.jpg', 'file:///tmp/image2.jpg', 's3://example/image3.jpg']
```

----------------------------------------

TITLE: Lance Execution Node Contract Definition
DESCRIPTION: Defines the contract for various execution nodes within Lance's I/O execution plan, detailing their parameters, input schemas, and output schemas.

SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/I-O-Execution.md#_snippet_0

LANGUAGE: APIDOC
CODE:
```
Execution Nodes:
  Scan:
    Parameters: dataset, projected columns
    Input Schema: N/A
    Output Schema: projected columns
  Filter:
    Parameters: input node, filter
    Input Schema: any
    Output Schema: input schema + columns in filters
  Take:
    Parameters: input node
    Input Schema: any, must have a "_rowid" column
    Output Schema: input schema minus _rowid
  KNNFlatExec:
    Parameters: input node, query
    Input Schema: any
    Output Schema: input schema + {"scores"}
  KNNIndexExec:
    Parameters: dataset
    Input Schema: N/A
    Output Schema: {"score", "_rowid"}
```

----------------------------------------

TITLE: Drop Columns from LanceDB Dataset in Java
DESCRIPTION: Shows how to remove specified columns from a LanceDB dataset. This operation simplifies the dataset's schema by eliminating unnecessary fields.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_8

LANGUAGE: java
CODE:
```
void dropColumns() {
    String datasetPath = ""; // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator()) {
        try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
            dataset.dropColumns(Collections.singletonList("name"));
        }
    }
}
```

----------------------------------------

TITLE: Describe Median Latency by IVF Index
DESCRIPTION: This snippet groups the DataFrame by the 'ivf' column and calculates descriptive statistics (count, mean, std, min, max, quartiles) for the '50%' (median response time) column. This helps understand latency distribution across different IVF index configurations.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_3

LANGUAGE: python
CODE:
```
df.groupby("ivf")["50%"].describe()
```

----------------------------------------

TITLE: Update Rows with Complex SQL Expressions
DESCRIPTION: Shows how to update column values using complex SQL expressions that can reference existing columns, such as incrementing an age column by a fixed value.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_5

LANGUAGE: python
CODE:
```
import lance

dataset = lance.dataset("./alice_and_bob.lance")
dataset.update({"age": "age + 2"})
```

----------------------------------------

TITLE: Add Rows to Lance Dataset
DESCRIPTION: Illustrates two methods for adding new rows to an existing Lance dataset: using the `LanceDataset.insert` method for direct insertion and using `lance.write_dataset` with `mode="append"` to append new data.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_2

LANGUAGE: python
CODE:
```
import lance
import pyarrow as pa

table = pa.Table.from_pylist([{"name": "Alice", "age": 20},
                              {"name": "Bob", "age": 30}])
ds = lance.write_dataset(table, "./insert_example.lance")

new_table = pa.Table.from_pylist([{"name": "Carla", "age": 37}])
ds.insert(new_table)
print(ds.to_table().to_pandas())
#     name  age
# 0  Alice   20
# 1    Bob   30
# 2  Carla   37

new_table2 = pa.Table.from_pylist([{"name": "David", "age": 42}])
ds = lance.write_dataset(new_table2, ds, mode="append")
print(ds.to_table().to_pandas())
#     name  age
# 0  Alice   20
# 1    Bob   30
# 2  Carla   37
# 3  David   42
```

----------------------------------------

TITLE: Bulk Update Rows in LanceDB Dataset using Merge Insert
DESCRIPTION: Demonstrates how to efficiently replace existing rows in a LanceDB dataset with new data using `merge_insert` and `when_matched_update_all()`. This operation uses a key for matching rows, typically a unique identifier. Note that modified rows are re-inserted, changing their position to the end of the table.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_7

LANGUAGE: python
CODE:
```
import lance

dataset = lance.dataset("./alice_and_bob.lance")
print(dataset.to_table().to_pandas())
#     name  age
# 0  Alice   20
# 1    Bob   30

# Change the ages of both Alice and Bob
new_table = pa.Table.from_pylist([{"name": "Alice", "age": 2},
                                  {"name": "Bob", "age": 3}])
# This will use `name` as the key for matching rows.  Merge insert
# uses a JOIN internally and so you typically want this column to
# be a unique key or id of some kind.
rst = dataset.merge_insert("name") \
       .when_matched_update_all() \
       .execute(new_table)
print(dataset.to_table().to_pandas())
#     name  age
# 0  Alice    2
# 1    Bob    3
```

----------------------------------------

TITLE: Load Single Weight Tensor from Lance Dataset (Python)
DESCRIPTION: This function converts a single weight entry, retrieved as a dictionary from a Lance dataset, into a PyTorch tensor. It reshapes the flattened 'value' array using the 'shape' information stored within the weight dictionary. The output is a torch.Tensor ready for further processing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_4

LANGUAGE: python
CODE:
```
def _load_weight(weight: dict) -> torch.Tensor:
    """Converts a weight dict to a torch tensor"""
    return torch.tensor(weight["value"], dtype=torch.float64).reshape(weight["shape"])
```

----------------------------------------

TITLE: Perform Parallel Writes with lance.fragment.write_fragments
DESCRIPTION: This code demonstrates how to write new data fragments in parallel across multiple workers using `lance.fragment.write_fragments`. Each worker generates its own set of fragments, which are then printed for verification. This is the first phase of a distributed write operation, preparing data for a later commit.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_0

LANGUAGE: python
CODE:
```
import json
from lance.fragment import write_fragments

# Run on each worker
data_uri = "./dist_write"
schema = pa.schema([
    ("a", pa.int32()),
    ("b", pa.string()),
])

# Run on worker 1
data1 = {
    "a": [1, 2, 3],
    "b": ["x", "y", "z"],
}
fragments_1 = write_fragments(data1, data_uri, schema=schema)
print("Worker 1: ", fragments_1)

# Run on worker 2
data2 = {
    "a": [4, 5, 6],
    "b": ["u", "v", "w"],
}
fragments_2 = write_fragments(data2, data_uri, schema=schema)
print("Worker 2: ", fragments_2)
```

----------------------------------------

TITLE: Drop Lance Dataset in Java
DESCRIPTION: This Java code illustrates how to permanently delete a Lance dataset from the file system. It takes the dataset's path and uses the static `Dataset.drop` method to remove all associated files and metadata. This operation is irreversible and should be used with caution.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_4

LANGUAGE: Java
CODE:
```
void dropDataset() {
    String datasetPath = tempDir.resolve("drop_stream").toString();
    Dataset.drop(datasetPath, new HashMap<>());
}
```

----------------------------------------

TITLE: LanceDB Statistics Storage
DESCRIPTION: Describes how statistics (null count, min, max) are stored within Lance files in a columnar format, enabling selective reading for query optimization.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_14

LANGUAGE: APIDOC
CODE:
```
Statistics Storage:
  - Location: Stored within Lance files.
  - Purpose: Determine which pages to skip during queries.
  - Data Points: null count, lower bound (min), upper bound (max).
  - Format: Lance's columnar format.
  - Benefit: Allows selective reading of relevant stats columns.
```

----------------------------------------

TITLE: Alter Columns in LanceDB Dataset in Java
DESCRIPTION: Illustrates how to modify existing columns within a LanceDB dataset. This includes renaming a column, changing its nullability, or casting its data type to a new ArrowType, facilitating schema adjustments.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_7

LANGUAGE: java
CODE:
```
void alterColumns() {
    String datasetPath = ""; // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator()) {
        try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
            ColumnAlteration nameColumnAlteration =
                    new ColumnAlteration.Builder("name")
                            .rename("new_name")
                            .nullable(true)
                            .castTo(new ArrowType.Utf8())
                            .build();

            dataset.alterColumns(Collections.singletonList(nameColumnAlteration));
        }
    }
}
```

----------------------------------------

TITLE: Group and sort captions by image ID
DESCRIPTION: This section iterates through all unique image IDs found in the annotations. For each image, it collects all associated captions and sorts them based on their original annotation number, ensuring the correct order of captions for each image. The result is a list of tuples, each containing an image ID and a tuple of its ordered captions.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_2

LANGUAGE: python
CODE:
```
captions = []
image_ids = set(ann[0] for ann in annotations)
for img_id in tqdm(image_ids):
    current_img_captions = []
    for ann_img_id, num, caption in annotations:
        if img_id == ann_img_id:
            current_img_captions.append((num, caption))

    # Sort by the annotation number
    current_img_captions.sort(key=lambda x: x[0])
    captions.append((img_id, tuple([x[1] for x in current_img_captions])))
```

----------------------------------------

TITLE: Create Scalar Index with Lindera Tokenizer in Python
DESCRIPTION: Python code demonstrating how to create a scalar index on a 'text' field using the 'INVERTED' index type, specifying 'lindera/ipadic' as the base tokenizer for text processing within LanceDB.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_5

LANGUAGE: python
CODE:
```
ds.create_scalar_index("text", "INVERTED", base_tokenizer="lindera/ipadic")
```

----------------------------------------

TITLE: Create Pandas Series with Lance BFloat16 Dtype
DESCRIPTION: This snippet demonstrates how to create a Pandas Series using the `lance.bfloat16` custom dtype. It shows the initialization of a Series with floating-point numbers, which are then converted to the BFloat16 format, suitable for machine learning applications.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_0

LANGUAGE: python
CODE:
```
import lance.arrow

pd.Series([1.1, 2.1, 3.4], dtype="lance.bfloat16")
# 0    1.1015625
# 1      2.09375
# 2      3.40625
# dtype: lance.bfloat16
```

----------------------------------------

TITLE: Define Lance Schema with Blob Column in Python
DESCRIPTION: This Python code demonstrates how to define a PyArrow schema for a Lance dataset, marking a `large_binary` column as a blob column by setting the `lance-encoding:blob` metadata to `true`. This configuration enables Lance to efficiently store and retrieve large binary objects.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/blob.md#_snippet_0

LANGUAGE: python
CODE:
```
import pyarrow as pa

schema = pa.schema(
    [
        pa.field("id", pa.int64()),
        pa.field("video",
            pa.large_binary(),
            metadata={"lance-encoding:blob": "true"}
        ),
    ]
)
```

----------------------------------------

TITLE: Describe Median Latency by NProbes
DESCRIPTION: This snippet groups the DataFrame by the 'nprobes' column and calculates descriptive statistics for the '50%' (median response time) column. This helps analyze how the number of probes affects median query latency.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_5

LANGUAGE: python
CODE:
```
df.groupby("nprobes")["50%"].describe()
```

----------------------------------------

TITLE: Test LanceDB Vector Search with NYT TF-IDF Vectors (Cosine Metric)
DESCRIPTION: Illustrates testing LanceDB's vector search with real-world data: sparse TF-IDF vectors from the New York Times dataset, projected to 256 dimensions. It uses the cosine similarity metric and custom index parameters (num_partitions=256, num_sub_vectors=32).

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_6

LANGUAGE: python
CODE:
```
# test NYT -- TF-IDF sparse vectors projected on to 256D dense -- cosine
data = _get_nyt_vectors()
data = data[np.linalg.norm(data, axis=1) != 0]
data = np.unique(data, axis=0)
query = np.random.standard_normal((100, 256))

recall_data = run_test(
    data,
    query,
    "cosine",
    num_partitions=256,
    num_sub_vectors=32,
)

make_plot(recall_data)
```

----------------------------------------

TITLE: Test LanceDB Vector Search with NYT TF-IDF Vectors (Normalized L2 Metric)
DESCRIPTION: Presents a test case using the same NYT TF-IDF vectors, but normalized for L2 distance, effectively making it equivalent to cosine similarity on normalized vectors. It uses the L2 metric with specific index parameters (num_partitions=512, num_sub_vectors=32) and visualizes the recall.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_7

LANGUAGE: python
CODE:
```
# test NYT -- TF-IDF sparse vectors projected on to 256D dense -- normalized L2
data = _get_nyt_vectors()
data = data[np.linalg.norm(data, axis=1) != 0]
data = np.unique(data, axis=0)
data /= np.linalg.norm(data, axis=1)[:, None]

# use the same out of sample query


recall_data = run_test(
    data,
    query,
    "L2",
    num_partitions=512,
    num_sub_vectors=32,
)

make_plot(recall_data)
```

----------------------------------------

TITLE: Update Rows in Lance Dataset by SQL Expression
DESCRIPTION: Demonstrates how to update specific columns of rows in a Lance dataset using the `lance.LanceDataset.update` method. The update values are SQL expressions, allowing for direct value assignment.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_4

LANGUAGE: python
CODE:
```
import lance

dataset = lance.dataset("./alice_and_bob.lance")
dataset.update({"name": "'Bob'"}, where="name = 'Blob'")
```

----------------------------------------

TITLE: Iteratively Read Large Lance Dataset in Batches
DESCRIPTION: This Python snippet demonstrates how to read a Lance dataset in batches, which is ideal for datasets too large to fit into memory. It uses `to_batches()` with column projection and filter push-down, allowing processing of data chunks iteratively.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_15

LANGUAGE: python
CODE:
```
for batch in ds.to_batches(columns=["image"], filter="label = 10"):
    # do something with batch
    compute_on_batch(batch)
```

----------------------------------------

TITLE: Perform Upsert Operation (Update or Insert) in LanceDB
DESCRIPTION: Shows how to combine `when_matched_update_all()` and `when_not_matched_insert_all()` within `merge_insert` to achieve an 'upsert' behavior. This operation updates rows if they exist and inserts them if they do not, providing a flexible way to synchronize data.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_9

LANGUAGE: python
CODE:
```
import lance
import pyarrow as pa

# Change Carla's age and insert David
new_table = pa.Table.from_pylist([{"name": "Carla", "age": 27},
                                  {"name": "David", "age": 42}])

dataset = lance.dataset("./alice_and_bob.lance")

# This will update Carla and insert David
_ = dataset.merge_insert("name") \
       .when_matched_update_all() \
       .when_not_matched_insert_all() \
       .execute(new_table)
# Verify the results
print(dataset.to_table().to_pandas())
#     name  age
# 0  Alice   20
# 1    Bob   30
# 2  Carla   27
# 3  David   42
```

----------------------------------------

TITLE: Configure LanceDB for S3 Express One Zone Buckets
DESCRIPTION: Shows how to explicitly configure LanceDB to access S3 Express One Zone (directory) buckets, especially when the bucket name is hidden by an access point or private link. This involves setting the `region` and `s3_express` flag in `storage_options` for direct access.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_6

LANGUAGE: python
CODE:
```
import lance
ds = lance.dataset(
    "s3://my-bucket--use1-az4--x-s3/path/imagenet.lance",
    storage_options={
        "region": "us-east-1",
        "s3_express": "true",
    }
)
```

----------------------------------------

TITLE: Add Schema-Only Columns to Lance Dataset
DESCRIPTION: Demonstrates how to add new columns to a Lance dataset without populating them, using `pyarrow.Field` or `pyarrow.Schema`. This operation is metadata-only and very efficient, useful for lazy population.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_0

LANGUAGE: python
CODE:
```
table = pa.table({"id": pa.array([1, 2, 3])})
dataset = lance.write_dataset(table, "null_columns")

# With pyarrow Field
dataset.add_columns(pa.field("embedding", pa.list_(pa.float32(), 128)))
assert dataset.schema == pa.schema([
    ("id", pa.int64()),
    ("embedding", pa.list_(pa.float32(), 128)),
])

# With pyarrow Schema
dataset.add_columns(pa.schema([
    ("label", pa.string()),
    ("score", pa.float32()),
]))
assert dataset.schema == pa.schema([
    ("id", pa.int64()),
    ("embedding", pa.list_(pa.float32(), 128)),
    ("label", pa.string()),
    ("score", pa.float32()),
])
```

----------------------------------------

TITLE: Commit Collected Fragments to a Lance Dataset
DESCRIPTION: After parallel writes, this snippet shows how to serialize fragment metadata from all workers, collect them on a single worker, and then commit them to a Lance dataset using `lance.LanceOperation.Overwrite`. It verifies the commit by reading the dataset and asserting its properties, demonstrating the final step of a distributed write.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_1

LANGUAGE: python
CODE:
```
import json
from lance import FragmentMetadata, LanceOperation

# Serialize Fragments into JSON data
fragments_json1 = [json.dumps(fragment.to_json()) for fragment in fragments_1]
fragments_json2 = [json.dumps(fragment.to_json()) for fragment in fragments_2]

# On one worker, collect all fragments
all_fragments = [FragmentMetadata.from_json(f) for f in \
    fragments_json1 + fragments_json2]

# Commit the fragments into a single dataset
# Use LanceOperation.Overwrite to overwrite the dataset or create new dataset.
op = lance.LanceOperation.Overwrite(schema, all_fragments)
read_version = 0 # Because it is empty at the time.
lance.LanceDataset.commit(
    data_uri,
    op,
    read_version=read_version,
)

# We can read the dataset using the Lance API:
dataset = lance.dataset(data_uri)
assert len(dataset.get_fragments()) == 2
assert dataset.version == 1
print(dataset.to_table().to_pandas())
```

----------------------------------------

TITLE: Merge Pre-computed Columns into Lance Dataset
DESCRIPTION: Explains how to integrate pre-computed columns into an existing Lance dataset using the `merge` method. This approach avoids rewriting the entire dataset by joining new data based on a specified column, as demonstrated with an 'id' column.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_3

LANGUAGE: python
CODE:
```
table = pa.table({
   "id": pa.array([1, 2, 3]),
   "embedding": pa.array([np.array([1, 2, 3]), np.array([4, 5, 6]),
                          np.array([7, 8, 9])])
})
dataset = lance.write_dataset(table, "embeddings", mode="overwrite")

new_data = pa.table({
   "id": pa.array([1, 2, 3]),
   "label": pa.array(["horse", "rabbit", "cat"])
})
dataset.merge(new_data, "id")
print(dataset.to_table().to_pandas())
```

----------------------------------------

TITLE: SQL Filter Expression with Escaped Column Names
DESCRIPTION: This SQL snippet shows how to handle column names that are SQL keywords or contain special characters (like spaces) by escaping them with backticks. It also demonstrates accessing nested fields with escaped names to ensure correct parsing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_17

LANGUAGE: sql
CODE:
```
`CUBE` = 10 AND `column name with space` IS NOT NULL
  AND `nested with space`.`inner with space` < 2
```

----------------------------------------

TITLE: LanceDB Page-level Statistics Schema Definition
DESCRIPTION: This schema defines the structure for storing page-level statistics for each field (column) within a Lance file. It includes the null count, minimum value, and maximum value for each field, typed according to the field's original data type. The schema is flexible, allowing for missing fields and future extensions.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_15

LANGUAGE: APIDOC
CODE:
```
<field_id_1>: struct
    null_count: i64
    min_value: <field_1_data_type>
    max_value: <field_1_data_type>
...
<field_id_N>: struct
    null_count: i64
    min_value: <field_N_data_type>
    max_value: <field_N_data_type>
```

----------------------------------------

TITLE: Define Custom TensorSpec for Lance TensorFlow Dataset Output
DESCRIPTION: This code shows how to explicitly define the `tf.TensorSpec` for the output signature of a `tf.data.Dataset` created from Lance. This is crucial for precise type and shape control, especially when automatic inference is insufficient or for complex data structures like ragged tensors.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_2

LANGUAGE: python
CODE:
```
batch_size = 256
ds = lance.tf.data.from_lance(
    "s3://my-bucket/my-dataset",
    columns=["image", "labels"],
    batch_size=batch_size,
    output_signature={
        "image": tf.TensorSpec(shape=(), dtype=tf.string),
        "labels": tf.RaggedTensorSpec(
            dtype=tf.int32, shape=(batch_size, None), ragged_rank=1),
    },
```

----------------------------------------

TITLE: SQL Literals for Date, Timestamp, and Decimal Types
DESCRIPTION: This SQL snippet illustrates how to specify literals for date, timestamp, and decimal columns in Lance filter expressions. It shows the syntax for casting string values to specific data types, ensuring correct interpretation during query execution.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_18

LANGUAGE: sql
CODE:
```
date_col = date '2021-01-01'
and timestamp_col = timestamp '2021-01-01 00:00:00'
and decimal_col = decimal(8,3) '1.000'
```

----------------------------------------

TITLE: Add New Columns to a Lance Dataset in a Distributed Manner
DESCRIPTION: This snippet demonstrates adding new columns to a Lance dataset efficiently without copying existing data. It shows how to merge columns on individual fragments across workers using `frag.merge_columns` and then commit the changes using `lance.LanceOperation.Merge` on a single worker. This leverages Lance's two-dimensional layout for metadata-only operations, making column additions highly efficient.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_3

LANGUAGE: python
CODE:
```
import lance
from pyarrow import RecordBatch
import pyarrow.compute as pc

dataset = lance.dataset("./add_columns_example")
assert len(dataset.get_fragments()) == 2
assert dataset.to_table().combine_chunks() == pa.Table.from_pydict({
    "name": ["alice", "bob", "charlie", "craig", "dave", "eve"],
    "age": [25, 33, 44, 55, 66, 77],
}, schema=schema)


def name_len(names: RecordBatch) -> RecordBatch:
    return RecordBatch.from_arrays(
        [pc.utf8_length(names["name"])],
        ["name_len"],
    )

# On Worker 1
frag1 = dataset.get_fragments()[0]
new_fragment1, new_schema = frag1.merge_columns(name_len, ["name"])

# On Worker 2
frag2 = dataset.get_fragments()[1]
new_fragment2, _ = frag2.merge_columns(name_len, ["name"])

# On Worker 3 - Commit
all_fragments = [new_fragment1, new_fragment2]
op = lance.LanceOperation.Merge(all_fragments, schema=new_schema)
lance.LanceDataset.commit(
    "./add_columns_example",
    op,
    read_version=dataset.version,
)

# Verify dataset
dataset = lance.dataset("./add_columns_example")
print(dataset.to_table().to_pandas())
```

----------------------------------------

TITLE: Plot Median Query Latency Histogram
DESCRIPTION: This snippet generates a histogram of the median query latency using seaborn's `displot` function. It visualizes the distribution of the '50%' column (median response time) from the DataFrame and sets appropriate x and y axis labels.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_2

LANGUAGE: python
CODE:
```
ax = sns.displot(df, x="50%")
ax.set(xlabel="Median response time seconds", ylabel="Number of configurations")
```

----------------------------------------

TITLE: Implement Custom PyTorch Sampler for Non-Overlapping Data
DESCRIPTION: The `LanceSampler` class is a custom PyTorch `Sampler` designed to prevent overlapping samples during LLM training, which can lead to overfitting. It ensures that the indices returned are `block_size` apart, guaranteeing that each sample processed by the model is unique and non-redundant. The sampler pre-calculates and shuffles available indices, yielding them during iteration to provide distinct data chunks for each batch.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_3

LANGUAGE: python
CODE:
```
class LanceSampler(Sampler):
    r"""Samples tokens randomly but `block_size` indices apart.

    Args:
        data_source (Dataset): dataset to sample from
        block_size (int): minimum index distance between each random sample
    """

    def __init__(self, data_source, block_size=512):
        self.data_source = data_source
        self.num_samples = len(self.data_source)
        self.available_indices = list(range(0, self.num_samples, block_size))
        np.random.shuffle(self.available_indices)

    def __iter__(self):
        yield from self.available_indices

    def __len__(self) -> int:
        return len(self.available_indices)
```

----------------------------------------

TITLE: Insert New Rows Only in LanceDB Dataset
DESCRIPTION: Illustrates how to use `merge_insert` with `when_not_matched_insert_all()` to insert data only if it doesn't already exist in the dataset. This is useful for preventing duplicate entries when processing batches of data where some records might have been added previously.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_8

LANGUAGE: python
CODE:
```
# Bob is already in the table, but Carla is new
new_table = pa.Table.from_pylist([{"name": "Bob", "age": 30},
                                  {"name": "Carla", "age": 37}])

dataset = lance.dataset("./alice_and_bob.lance")

# This will insert Carla but leave Bob unchanged
_ = dataset.merge_insert("name") \
       .when_not_matched_insert_all() \
       .execute(new_table)
# Verify that Carla was added but Bob remains unchanged
print(dataset.to_table().to_pandas())
#     name  age
# 0  Alice   20
# 1    Bob   30
# 2  Carla   37
```

----------------------------------------

TITLE: Replace Filtered Data with New Rows in LanceDB
DESCRIPTION: Explains a less common but powerful use case of `merge_insert` to replace a specific region of existing rows (defined by a filter) with new data. This effectively acts as a combined delete and insert operation within a single transaction, using `when_not_matched_by_source_delete()`.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_10

LANGUAGE: python
CODE:
```
import lance
import pyarrow as pa

new_table = pa.Table.from_pylist([{"name": "Edgar", "age": 46},
                                  {"name": "Francene", "age": 44}])

dataset = lance.dataset("./alice_and_bob.lance")
print(dataset.to_table().to_pandas())
#       name  age
# 0    Alice   20
# 1      Bob   30
# 2  Charlie   45
# 3    Donna   50

# This will remove anyone above 40 and insert our new data
_ = dataset.merge_insert("name") \
       .when_not_matched_insert_all() \
       .when_not_matched_by_source_delete("age >= 40") \
       .execute(new_table)
# Verify the results - people over 40 replaced with new data
print(dataset.to_table().to_pandas())
#        name  age
# 0     Alice   20
# 1       Bob   30
# 2     Edgar   46
# 3  Francene   44
```

----------------------------------------

TITLE: Distributed Training with Shuffled Lance Fragments in TensorFlow
DESCRIPTION: This snippet outlines a strategy for distributed training by sharding and shuffling Lance fragments across multiple workers. It uses `lance_fragments` to manage the distribution of data, ensuring each worker processes a unique subset of the dataset for efficient parallel training.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_3

LANGUAGE: python
CODE:
```
import tensorflow as tf
from lance.tf.data import from_lance, lance_fragments

world_size = 32
rank = 10
seed = 123  #
epoch = 100

dataset_uri = "s3://my-bucket/my-dataset"

# Shuffle fragments distributedly.
fragments =
    lance_fragments("s3://my-bucket/my-dataset")
    .shuffling(32, seed=seed)
    .repeat(epoch)
    .enumerate()
    .filter(lambda i, _: i % world_size == rank)
    .map(lambda _, fid: fid)

ds = from_lance(
    uri,
    columns=["image", "label"],
    fragments=fragments,
    batch_size=32
    )
for batch in ds:
    print(batch)
```

----------------------------------------

TITLE: LanceDB Deletion File Naming Convention
DESCRIPTION: This snippet specifies the naming convention for deletion files in LanceDB, which are used to mark rows for deletion. It details the components of the filename, including fragment ID, read version, and a random ID, along with the file type suffix (Arrow or Roaring Bitmap).

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_9

LANGUAGE: text
CODE:
```
_deletions/{fragment_id}-{read_version}-{random_id}.{arrow|bin}
```

----------------------------------------

TITLE: Convert NumPy BFloat16 Array to Lance Extension Arrays
DESCRIPTION: This snippet demonstrates how to convert an existing NumPy array of `bfloat16` dtype into Lance's `PandasBFloat16Array` or `BFloat16Array`. It showcases the interoperability between NumPy's `ml_dtypes` and Lance's extension arrays, facilitating data integration.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_2

LANGUAGE: python
CODE:
```
import numpy as np
from ml_dtypes import bfloat16
from lance.arrow import PandasBFloat16Array, BFloat16Array

np_array = np.array([1.1, 2.1, 3.4], dtype=bfloat16)
PandasBFloat16Array.from_numpy(np_array)
# <PandasBFloat16Array>
# [1.1015625, 2.09375, 3.40625]
# Length: 3, dtype: lance.bfloat16
BFloat16Array.from_numpy(np_array)
# <lance.arrow.BFloat16Array object at 0x...>
# [
#   1.1015625,
#   2.09375,
#   3.40625
# ]
```

----------------------------------------

TITLE: Rename Nested Columns in LanceDB Dataset
DESCRIPTION: This snippet demonstrates how to rename nested columns within a LanceDB dataset using `lance.LanceDataset.alter_columns`. It shows how to specify nested paths using dot notation (e.g., 'meta.id') and verifies the renaming by printing the dataset's content.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_6

LANGUAGE: python
CODE:
```
data = [
  {"meta": {"id": 1, "name": "Alice"}},
  {"meta": {"id": 2, "name": "Bob"}}
]
schema = pa.schema([
    ("meta", pa.struct([
        ("id", pa.int32()),
        ("name", pa.string()),
    ]))
])
dataset = lance.write_dataset(data, "nested_rename")
dataset.alter_columns({"path": "meta.id", "name": "new_id"})
print(dataset.to_table().to_pandas())
#                                  meta
# 0  {'new_id': 1, 'name': 'Alice'}
# 1    {'new_id': 2, 'name': 'Bob'}
```

----------------------------------------

TITLE: Delete Rows from Lance Dataset by SQL Filter
DESCRIPTION: Explains how to delete rows from a Lance dataset using a SQL-like filter expression with the `LanceDataset.delete` method. Note that this operation creates a new version of the dataset, requiring it to be reopened to see changes.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_3

LANGUAGE: python
CODE:
```
import lance

dataset = lance.dataset("./alice_and_bob.lance")
dataset.delete("name = 'Bob'")
dataset2 = lance.dataset("./alice_and_bob.lance")
print(dataset2.to_table().to_pandas())
#     name  age
# 0  Alice   20
```