Files
herodb/specs/backgroundinfo/lancedb.md
2025-08-23 04:57:47 +02:00

248 KiB

======================== CODE SNIPPETS

TITLE: Run LanceDB documentation examples tests DESCRIPTION: Checks the documentation examples for correctness and consistency, ensuring they function as expected.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_3

LANGUAGE: shell CODE:

make doctest

TITLE: Install documentation website requirements DESCRIPTION: This command installs the necessary Python packages for building the main documentation website, which is powered by mkdocs-material. It ensures all dependencies are met before serving the docs.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_7

LANGUAGE: bash CODE:

pip install -r docs/requirements.txt

TITLE: Build and serve documentation website locally DESCRIPTION: These commands navigate to the docs directory and start a local development server for the documentation website. This allows contributors to preview changes to the documentation in real-time.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_8

LANGUAGE: bash CODE:

cd docs
mkdocs serve

TITLE: Perform Python development installation DESCRIPTION: These commands navigate into the python directory and perform a development installation of the Lance Python bindings. This allows developers to import and test changes to the Python wrapper directly.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_1

LANGUAGE: bash CODE:

cd python
maturin develop

TITLE: Example output of git commit with pre-commit hooks DESCRIPTION: Demonstrates the console output when committing changes after pre-commit hooks are installed, showing the execution and status of linters like black, isort, and ruff.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_8

LANGUAGE: shell CODE:

git commit -m"Changed some python files"
black....................................................................Passed
isort (python)...........................................................Passed
ruff.....................................................................Passed
[main daf91ed] Changed some python files
 1 file changed, 1 insertion(+), 1 deletion(-)

TITLE: Install LanceDB test dependencies DESCRIPTION: Installs the necessary Python packages for running tests, including optional test dependencies specified in the project's setup.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_1

LANGUAGE: shell CODE:

pip install '.[tests]'

TITLE: Install pre-commit tool for LanceDB DESCRIPTION: Installs the pre-commit tool, which enables running formatters and linters automatically before each Git commit.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_6

LANGUAGE: shell CODE:

pip install pre-commit

TITLE: Download and Extract SIFT 1M Dataset DESCRIPTION: This snippet provides shell commands to download and extract the SIFT 1M dataset, which is used as a large-scale example for vector search demonstrations. It includes commands to clean up previous downloads and extract the compressed archive.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_11

LANGUAGE: bash CODE:

rm -rf sift* vec_data.lance
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz

TITLE: Create Pandas DataFrame DESCRIPTION: This code demonstrates how to create a simple Pandas DataFrame. This DataFrame serves as a basic example for subsequent operations, such as writing data to a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_1

LANGUAGE: python CODE:

df = pd.DataFrame({"a": [5]})
df

TITLE: TPCH Benchmark Setup and Execution DESCRIPTION: This snippet outlines the steps to set up the dataset and run the TPCH Q1 benchmark comparing LanceDB and Parquet. It includes navigating to the benchmark directory, creating a dataset folder, downloading and renaming the necessary Parquet file, and executing the benchmark script. Note: The step to 'generate lance file' is a conceptual action within the benchmark process, not an explicit command provided.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/tpch/README.md#_snippet_0

LANGUAGE: Shell CODE:

cd lance/benchmarks/tpch
mkdir dataset && cd dataset
wget https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet -O lineitem_sf1.parquet
cd ..

LANGUAGE: Shell CODE:

python3 benchmark.py q1

TITLE: Install LanceDB pre-commit hooks DESCRIPTION: Installs the pre-commit hooks defined in the project's configuration, activating automatic linting and formatting on commit attempts.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_7

LANGUAGE: shell CODE:

pre-commit install

TITLE: Install Python bindings build tool DESCRIPTION: This command installs maturin, a tool essential for building Python packages that integrate with Rust code. It's a prerequisite for setting up the Python development environment for Lance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_0

LANGUAGE: bash CODE:

pip install maturin

TITLE: Start Local Services for S3 Integration Tests DESCRIPTION: Before running S3 integration tests, you need to start local Minio and DynamoDB services. This command uses Docker Compose to bring up these required services, ensuring the test environment is ready.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_20

LANGUAGE: Shell CODE:

docker compose up

TITLE: Install preview pylance Python SDK via pip DESCRIPTION: Install the preview version of the pylance Python SDK to access the latest features and bug fixes. This uses a specific extra index URL for LanceDB's PyPI.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/install.md#_snippet_1

LANGUAGE: Bash CODE:

pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance

TITLE: Access Specific Lance Dataset Version DESCRIPTION: This example demonstrates how to load and query a specific historical version of a Lance dataset. By specifying the version parameter, users can access data as it existed at a particular point in time, enabling historical analysis or rollbacks.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_8

LANGUAGE: python CODE:

# Version 1
lance.dataset('/tmp/test.lance', version=1).to_table().to_pandas()

# Version 2
lance.dataset('/tmp/test.lance', version=2).to_table().to_pandas()

TITLE: Install stable pylance Python SDK via pip DESCRIPTION: Install the stable and recommended version of the pylance Python SDK using the pip package manager.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/install.md#_snippet_0

LANGUAGE: Bash CODE:

pip install pylance

TITLE: Run all LanceDB tests DESCRIPTION: Executes the full test suite for the LanceDB project using the make test command.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_2

LANGUAGE: shell CODE:

make test

TITLE: Install Linux Perf Tools and Configure Kernel Parameters DESCRIPTION: Installs necessary Linux performance tools (perf) on Ubuntu systems and configures the perf_event_paranoid kernel parameter. This setup is crucial for allowing non-root users to collect performance data using tools like perf and flamegraph.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_4

LANGUAGE: sh CODE:

sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`
sudo sh -c "echo -1 >  /proc/sys/kernel/perf_event_paranoid"

TITLE: Load Lance Vector Dataset DESCRIPTION: This snippet shows how to load a previously created Lance vector dataset. This step is essential before performing any vector search queries or other operations on the dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_13

LANGUAGE: python CODE:

uri = "vec_data.lance"
sift1m = lance.dataset(uri)

TITLE: Prepare Parquet File from Pandas DataFrame DESCRIPTION: This code prepares a Parquet file from a Pandas DataFrame using PyArrow. It cleans up any existing Parquet or Lance files to ensure a fresh start, then converts the DataFrame to a PyArrow Table and writes it as a Parquet dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_3

LANGUAGE: python CODE:

shutil.rmtree("/tmp/test.parquet", ignore_errors=True)
shutil.rmtree("/tmp/test.lance", ignore_errors=True)

tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, "/tmp/test.parquet", format='parquet')

parquet = pa.dataset.dataset("/tmp/test.parquet")
parquet.to_table().to_pandas()

TITLE: Install required Python libraries DESCRIPTION: Installs necessary Python packages for data handling, OpenAI API interaction, rate limiting, and LanceDB. The --quiet flag suppresses verbose output during installation.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_0

LANGUAGE: python CODE:

pip install --quiet openai tqdm ratelimiter retry datasets pylance

TITLE: Run Rust unit tests DESCRIPTION: This command executes the unit tests for the Rust core format. Running these tests verifies the correctness of the Rust implementation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_6

LANGUAGE: bash CODE:

cargo test

TITLE: Profile a LanceDB benchmark using flamegraph DESCRIPTION: Generates a flamegraph for a specific benchmark using cargo-flamegraph, aiding in performance analysis. It's recommended to run benchmarks once beforehand to avoid setup time being captured in the profile.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_14

LANGUAGE: shell CODE:

flamegraph -F 100 --no-inline -- $(which python) \
    -m pytest python/benchmarks \
    --benchmark-min-time=2 \
    -k test_ivf_pq_index_search

TITLE: Install Flamegraph Tool DESCRIPTION: Installs the flamegraph profiling tool using Cargo, Rust's package manager. This tool is essential for visualizing CPU usage and call stacks as flame graphs for performance analysis.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_3

LANGUAGE: sh CODE:

cargo install flamegraph

TITLE: Set up BigANN Benchmark Environment DESCRIPTION: This snippet provides commands to set up a Python virtual environment, clone the 'big-ann-benchmarks' repository, and install its required dependencies. It prepares the system for running BigANN benchmarks by ensuring all necessary tools and libraries are in place.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/bigann/README.md#_snippet_0

LANGUAGE: bash CODE:

python -m venv venv
. ./venv/bin/activate
git clone https://github.com/harsha-simhadri/big-ann-benchmarks.git
cd big-ann-benchmarks
pip install -r requirements_py3.10.txt

TITLE: List Lance Dataset Versions DESCRIPTION: This code shows how to retrieve a list of all available versions for a Lance dataset. This functionality is crucial for understanding the history of changes and for accessing specific historical states of the data.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_7

LANGUAGE: python CODE:

dataset.versions()

TITLE: Install Lance Build Dependencies on Ubuntu DESCRIPTION: This command installs necessary system-level dependencies for building Lance on Ubuntu 22.04, including protobuf, SSL development libraries, and general build tools.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_0

LANGUAGE: bash CODE:

sudo apt install protobuf-compiler libssl-dev build-essential pkg-config gfortran

TITLE: Build Rust core format (release) DESCRIPTION: This command compiles the Rust core format in release mode. The release build is optimized for performance and is suitable for production deployments or benchmarking.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_5

LANGUAGE: bash CODE:

cargo build -r

TITLE: Debug Python Script with LLDB DESCRIPTION: Demonstrates how to start an LLDB debugging session for a Python script. It involves launching LLDB with the Python interpreter from a virtual environment and then running the target script within the LLDB prompt.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_2

LANGUAGE: sh CODE:

$ lldb ./venv/bin/python
(lldb) r script.py

TITLE: Install Lance Build Dependencies on Mac DESCRIPTION: This command installs the protobuf compiler using Homebrew, a required dependency for building Lance on macOS.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_1

LANGUAGE: bash CODE:

brew install protobuf

TITLE: Configure LLDB Initialization Settings DESCRIPTION: Sets up basic LLDB initialization settings in the ~/.lldbinit file. This includes configuring the number of source code lines to display before and after a stop, and enabling the loading of .lldbinit files from the current working directory.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_0

LANGUAGE: lldb CODE:

# ~/.lldbinit
settings set stop-line-count-before 15
settings set stop-line-count-after 15
settings set target.load-cwd-lldbinit true

TITLE: List all versions of a Lance dataset DESCRIPTION: Retrieves and displays the version history of the Lance dataset, showing all previous and current states of the data.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_9

LANGUAGE: Python CODE:

dataset.versions()

TITLE: Load Lance Dataset DESCRIPTION: Initializes a Lance dataset object from a specified URI, preparing it for subsequent operations like nearest neighbor searches.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_20

LANGUAGE: python CODE:

sift1m = lance.dataset(uri)

TITLE: Complete Lance Dataset Write and Read Example in Rust DESCRIPTION: This Rust main function provides a complete example demonstrating the usage of write_dataset and read_dataset functions. It sets up the necessary arrow and lance imports, defines a temporary data path, and orchestrates the writing and subsequent reading of a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_2

LANGUAGE: Rust CODE:

use arrow::array::UInt32Array;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::{RecordBatch, RecordBatchIterator};
use futures::StreamExt;
use lance::dataset::{WriteMode, WriteParams};
use lance::Dataset;
use std::sync::Arc;

#[tokio::main]
async fn main() {
    let data_path: &str = "./temp_data.lance";

    write_dataset(data_path).await;
    read_dataset(data_path).await;
}

TITLE: Rust: Main Workflow for WikiText to LanceDB Ingestion DESCRIPTION: This comprehensive example demonstrates the full data ingestion pipeline in Rust. It initializes a Tokio runtime, loads a tokenizer, sets up the Hugging Face API to download WikiText Parquet files, processes them into a WikiTextBatchReader, and finally writes the data to a Lance dataset. It also includes verification of the created dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_2

LANGUAGE: Rust CODE:

fn main() -> Result<(), Box<dyn Error + Send + Sync>> {
    let rt = tokio::runtime::Runtime::new()?;
    rt.block_on(async {
        // Load tokenizer
        let tokenizer = load_tokenizer("gpt2")?;

        // Set up Hugging Face API
        // Download from https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-raw-v1
        let api = Api::new()?;
        let repo = api.repo(Repo::with_revision(
            "Salesforce/wikitext".into(),
            RepoType::Dataset,
            "main".into(),
        ));

        // Define the parquet files we want to download
        let train_files = vec![
            "wikitext-103-raw-v1/train-00000-of-00002.parquet",
            "wikitext-103-raw-v1/train-00001-of-00002.parquet",
        ];

        let mut parquet_readers = Vec::new();
        for file in &train_files {
            println!("Downloading file: {}", file);
            let file_path = repo.get(file)?;
            let data = std::fs::read(file_path)?;

            // Create a temporary file in the system temp directory and write the downloaded data to it
            let mut temp_file = NamedTempFile::new()?;
            temp_file.write_all(&data)?;

            // Create the parquet reader builder with a larger batch size
            let builder = ParquetRecordBatchReaderBuilder::try_new(temp_file.into_file())?
                .with_batch_size(8192); // Increase batch size for better performance
            parquet_readers.push(builder);
        }

        if parquet_readers.is_empty() {
            println!("No parquet files found to process.");
            return Ok(());
        }

        // Create batch reader
        let num_samples: u64 = 500_000;
        let batch_reader = WikiTextBatchReader::new(parquet_readers, tokenizer, Some(num_samples))?;

        // Save as Lance dataset
        println!("Writing to Lance dataset...");
        let lance_dataset_path = "rust_wikitext_lance_dataset.lance";

        let write_params = WriteParams::default();
        lance::Dataset::write(batch_reader, lance_dataset_path, Some(write_params)).await?;

        // Verify the dataset
        let ds = lance::Dataset::open(lance_dataset_path).await?;
        let scanner = ds.scan();
        let mut stream = scanner.try_into_stream().await?;

        let mut total_rows = 0;
        while let Some(batch_result) = stream.next().await {
            let batch = batch_result?;
            total_rows += batch.num_rows();
        }

        println!(
            "Lance dataset created successfully with {} rows",
            total_rows
        );
        println!("Dataset location: {}", lance_dataset_path);

        Ok(())
    })
}

TITLE: Build and Test Pylance Python Package DESCRIPTION: These commands set up a Python virtual environment, install maturin for Rust-Python binding, build the Pylance package in debug mode, and then run its associated tests.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_3

LANGUAGE: bash CODE:

cd python
python3 -m venv venv
source venv/bin/activate

pip install maturin

# Build debug build
maturin develop --extras tests

# Run pytest
pytest python/tests/

TITLE: Install Lance using Cargo DESCRIPTION: Installs the Lance Rust library as a command-line tool using the Cargo package manager.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_0

LANGUAGE: shell CODE:

cargo install lance

TITLE: Append Data to Lance Dataset DESCRIPTION: This example illustrates how to append new data to an existing Lance dataset. It creates a new Pandas DataFrame, converts it to a PyArrow Table, and then uses lance.write_dataset with mode="append" to add the new rows, creating a new version of the dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_5

LANGUAGE: python CODE:

df = pd.DataFrame({"a": [10]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="append")

dataset.to_table().to_pandas()

TITLE: Access Lance Dataset by Tag DESCRIPTION: This code demonstrates how to load a Lance dataset using a previously defined tag instead of a numerical version. This allows for more intuitive access to specific, meaningful versions of the data, improving readability and maintainability.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_10

LANGUAGE: python CODE:

lance.dataset('/tmp/test.lance', version="stable").to_table().to_pandas()

TITLE: Build pylance in release mode for benchmarks DESCRIPTION: Builds the pylance module in release mode with debug symbols, enabling benchmark execution and profiling. It includes benchmark-specific extras and features for data generation.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_10

LANGUAGE: shell CODE:

maturin develop --profile release-with-debug --extras benchmarks --features datagen

TITLE: Query Lance Dataset with Simple SQL in Rust DataFusion DESCRIPTION: This Rust example demonstrates how to register a Lance dataset as a table in DataFusion using LanceTableProvider and execute a simple SQL SELECT query to retrieve the first 10 rows. It shows the basic setup for integrating Lance with DataFusion.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_0

LANGUAGE: rust CODE:

use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;

let ctx = SessionContext::new();

ctx.register_table("dataset",
    Arc::new(LanceTableProvider::new(
    Arc::new(dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

let df = ctx.sql("SELECT * FROM dataset LIMIT 10").await?;
let result = df.collect().await?;

TITLE: Install Lance Preview Release DESCRIPTION: Installs a preview release of the pylance library, which includes the latest features and bug fixes. Preview releases are published more frequently and offer early access to new developments.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_1

LANGUAGE: shell CODE:

pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance

TITLE: Install LanceDB and Python Dependencies DESCRIPTION: Installs specific versions of LanceDB, pandas, and duckdb required for running the benchmarks. This ensures compatibility and reproducibility of the benchmark results.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_0

LANGUAGE: sh CODE:

pip lancedb==0.3.6
pip install pandas~=2.1.0
pip duckdb~=0.9.0

TITLE: Prepare HD-Vila Dataset with Python venv DESCRIPTION: This snippet outlines the steps to set up a Python virtual environment, activate it, and install necessary dependencies from requirements.txt for the HD-Vila dataset. It ensures a clean and isolated environment for project dependencies.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/hd-vila/README.md#_snippet_0

LANGUAGE: python CODE:

python3 -m venv venv
. ./venv/bin/activate
pip install -r requirements.txt

TITLE: Run Python unit and integration tests DESCRIPTION: These commands execute the unit tests and integration tests for the Python components of the Lance project. Running these tests is crucial to ensure code changes do not introduce regressions.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_2

LANGUAGE: bash CODE:

make test
make integtest

TITLE: Import necessary libraries for LanceDB operations DESCRIPTION: This snippet imports shutil, lance, numpy, pandas, and pyarrow for file system operations, LanceDB interactions, numerical computing, data manipulation, and Arrow table handling, respectively.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_0

LANGUAGE: Python CODE:

import shutil

import lance
import numpy as np
import pandas as pd
import pyarrow as pa

TITLE: Create a Pandas DataFrame for LanceDB DESCRIPTION: Initializes a simple Pandas DataFrame with a single column 'a' and a value of 5. This DataFrame will be used as input for creating a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_1

LANGUAGE: Python CODE:

df = pd.DataFrame({"a": [5]})
df

TITLE: Sample Query Vectors from Lance Dataset DESCRIPTION: This code demonstrates how to sample a subset of vectors from the loaded Lance dataset to be used as query vectors for nearest neighbor search. It leverages DuckDB for efficient sampling of the vector column.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_14

LANGUAGE: python CODE:

import duckdb
# Make sure DuckDB v0.7+ is installed
samples = duckdb.query("SELECT vector FROM sift1m USING SAMPLE 100").to_df().vector

TITLE: Execute Tunable Nearest Neighbor Search DESCRIPTION: Demonstrates how to perform a nearest neighbor search with tunable parameters like 'nprobes' and 'refine_factor' to balance latency and recall. The result is converted to a Pandas DataFrame.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_22

LANGUAGE: python CODE:

%%time

sift1m.to_table(
    nearest={
        "column": "vector",
        "q": samples[0],
        "k": 10,
        "nprobes": 10,
        "refine_factor": 5
    }
).to_pandas()

TITLE: Load SIFT vector dataset from Lance file DESCRIPTION: Defines the URI for the Lance vector dataset and then loads it using lance.dataset(), making the SIFT 1M vector data accessible for further operations.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_16

LANGUAGE: Python CODE:

uri = "vec_data.lance"
sift1m = lance.dataset(uri)

TITLE: Import LanceDB Libraries DESCRIPTION: This snippet imports the necessary Python libraries for working with LanceDB, including shutil for file operations, lance for core LanceDB functionalities, numpy for numerical operations, pandas for data manipulation, and pyarrow for data interchange.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_0

LANGUAGE: python CODE:

import shutil
import lance
import numpy as np
import pandas as pd
import pyarrow as pa

TITLE: Run all LanceDB benchmarks (including slow tests) DESCRIPTION: Executes all performance benchmarks, including those marked as 'slow', which may take a longer time to complete.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_12

LANGUAGE: shell CODE:

pytest python/benchmarks

TITLE: Prepare Python Virtual Environment for Benchmarks DESCRIPTION: Creates and activates a Python virtual environment, then installs required packages from requirements.txt. This isolates project dependencies and ensures a clean execution environment for the benchmark scripts.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_2

LANGUAGE: sh CODE:

python3 -m venv venv
. ./venv/bin/activate
pip install -r requirements.txt

TITLE: Create Tags for Lance Dataset Versions DESCRIPTION: This snippet illustrates how to create human-readable tags for specific versions of a Lance dataset. Tags provide a convenient way to mark and reference important dataset states, such as 'stable' or 'nightly' builds, simplifying version management.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_9

LANGUAGE: python CODE:

dataset.tags.create("stable", 2)
dataset.tags.create("nightly", 3)
dataset.tags.list()

TITLE: Run LanceDB code formatters DESCRIPTION: Applies code formatting rules to the entire project. Specific commands like make format-python or cargo fmt can be used for language-specific formatting.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_4

LANGUAGE: shell CODE:

make format

TITLE: Build and Search HNSW Index for Vector Similarity in Rust DESCRIPTION: This Rust code provides a complete example for vector similarity search. It defines a ground_truth function for L2 distance calculation, create_test_vector_dataset to generate synthetic fixed-size list vectors, and a main function that orchestrates the process. The main function generates or loads a dataset, builds an HNSW index using lance_index::vector::hnsw, and then performs vector searches, measuring construction and search times, and calculating recall against ground truth.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/hnsw.md#_snippet_0

LANGUAGE: Rust CODE:

use std::collections::HashSet;
use std::sync::Arc;

use arrow::array::{types::Float32Type, Array, FixedSizeListArray};
use arrow::array::{AsArray, FixedSizeListBuilder, Float32Builder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchIterator;
use arrow_select::concat::concat;
use futures::stream::StreamExt;
use lance::Dataset;
use lance_index::vector::v3::subindex::IvfSubIndex;
use lance_index::vector::{
    flat::storage::FlatFloatStorage,
    hnsw::{builder::HnswBuildParams, HNSW},
};
use lance_linalg::distance::DistanceType;

fn ground_truth(fsl: &FixedSizeListArray, query: &[f32], k: usize) -> HashSet<u32> {
    let mut dists = vec![];
    for i in 0..fsl.len() {
        let dist = lance_linalg::distance::l2_distance(
            query,
            fsl.value(i).as_primitive::<Float32Type>().values(),
        );
        dists.push((dist, i as u32));
    }
    dists.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap());
    dists.truncate(k);
    dists.into_iter().map(|(_, i)| i).collect()
}

pub async fn create_test_vector_dataset(output: &str, num_rows: usize, dim: i32) {
    let schema = Arc::new(Schema::new(vec![Field::new(
        "vector",
        DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), dim),
        false,
    )]));

    let mut batches = Vec::new();

    // Create a few batches
    for _ in 0..2 {
        let v_builder = Float32Builder::new();
        let mut list_builder = FixedSizeListBuilder::new(v_builder, dim);

        for _ in 0..num_rows {
            for _ in 0..dim {
                list_builder.values().append_value(rand::random::<f32>());
            }
            list_builder.append(true);
        }
        let array = Arc::new(list_builder.finish());
        let batch = RecordBatch::try_new(schema.clone(), vec![array]).unwrap();
        batches.push(batch);
    }
    let batch_reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema.clone());
    println!("Writing dataset to {}", output);
    Dataset::write(batch_reader, output, None).await.unwrap();
}

#[tokio::main]
async fn main() {
    let uri: Option<String> = None; // None means generate test data
    let column = "vector";
    let ef = 100;
    let max_edges = 30;
    let max_level = 7;

    // 1. Generate a synthetic test data of specified dimensions
    let dataset = if uri.is_none() {
        println!("No uri is provided, generating test dataset...");
        let output = "test_vectors.lance";
        create_test_vector_dataset(output, 1000, 64).await;
        Dataset::open(output).await.expect("Failed to open dataset")
    } else {
        Dataset::open(uri.as_ref().unwrap())
            .await
            .expect("Failed to open dataset")
    };

    println!("Dataset schema: {:#?}", dataset.schema());
    let batches = dataset
        .scan()
        .project(&[column])
        .unwrap()
        .try_into_stream()
        .await
        .unwrap()
        .then(|batch| async move { batch.unwrap().column_by_name(column).unwrap().clone() })
        .collect::<Vec<_>>()
        .await;
    let arrs = batches.iter().map(|b| b.as_ref()).collect::<Vec<_>>();
    let fsl = concat(&arrs).unwrap().as_fixed_size_list().clone();
    println!("Loaded {:?} batches", fsl.len());

    let vector_store = Arc::new(FlatFloatStorage::new(fsl.clone(), DistanceType::L2));

    let q = fsl.value(0);
    let k = 10;
    let gt = ground_truth(&fsl, q.as_primitive::<Float32Type>().values(), k);

    for ef_construction in [15, 30, 50] {
        let now = std::time::Instant::now();
        // 2. Build a hierarchical graph structure for efficient vector search using Lance API
        let hnsw = HNSW::index_vectors(
            vector_store.as_ref(),
            HnswBuildParams::default()
                .max_level(max_level)
                .num_edges(max_edges)
                .ef_construction(ef_construction),
        )
        .unwrap();
        let construct_time = now.elapsed().as_secs_f32();
        let now = std::time::Instant::now();
        // 3. Perform vector search with different parameters and compute the ground truth using L2 distance search
        let results: HashSet<u32> = hnsw
            .search_basic(q.clone(), k, ef, None, vector_store.as_ref())
            .unwrap()
            .iter()
            .map(|node| node.id)
            .collect();
        let search_time = now.elapsed().as_micros();
        println!(
            "level={}, ef_construct={}, ef={} recall={}: construct={:.3}s search={:.3} us",
            max_level,
            ef_construction,
            ef,
            results.intersection(&gt).count() as f32 / k as f32,
            construct_time,
            search_time
        );
    }
}

TITLE: LanceDB Nearest Neighbor Search Parameters DESCRIPTION: This section details the parameters available for tuning nearest neighbor searches in LanceDB, including 'q', 'k', 'nprobes', and 'refine_factor'.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_19

LANGUAGE: APIDOC CODE:

"nearest": {
  "column": "string", // Name of the vector column
  "q": "vector",      // The query vector for nearest neighbor search
  "k": "integer",     // The number of nearest neighbors to return
  "nprobes": "integer", // How many IVF partitions to search
  "refine_factor": "integer" // Controls re-ranking: if k=10 and refine_factor=5, retrieves 50 nearest neighbors by ANN and re-sorts using actual distances, then returns top 10. Improves recall without sacrificing performance too much.
}

TITLE: Install Lance Python Library DESCRIPTION: Installs the stable release of the pylance library using pip, providing access to Lance's functionalities in Python.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_0

LANGUAGE: shell CODE:

pip install pylance

TITLE: Convert Parquet Dataset to Lance DESCRIPTION: This snippet demonstrates the straightforward conversion of an existing PyArrow Parquet dataset into a Lance dataset. It uses lance.write_dataset to perform the conversion and then verifies the content of the newly created Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_4

LANGUAGE: python CODE:

dataset = lance.write_dataset(parquet, "/tmp/test.lance")

# Make sure it's the same
dataset.to_table().to_pandas()

TITLE: Convert Parquet dataset to Lance dataset DESCRIPTION: Converts an existing PyArrow Parquet dataset directly into a Lance dataset in a single line of code, demonstrating seamless integration.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_4

LANGUAGE: Python CODE:

dataset = lance.write_dataset(parquet, "/tmp/test.lance")

TITLE: Compare LanceDB benchmarks against previous version DESCRIPTION: Provides a sequence of commands to compare the performance of the current version against the main branch. This involves saving a baseline from main and then comparing the current branch's performance against it.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_15

LANGUAGE: shell CODE:

CURRENT_BRANCH=$(git branch --show-current)

LANGUAGE: shell CODE:

git checkout main

LANGUAGE: shell CODE:

maturin develop --profile release-with-debug  --features datagen

LANGUAGE: shell CODE:

pytest --benchmark-save=baseline python/benchmarks -m "not slow"

LANGUAGE: shell CODE:

COMPARE_ID=$(ls .benchmarks/*/ | tail -1 | cut -c1-4)

LANGUAGE: shell CODE:

git checkout $CURRENT_BRANCH

LANGUAGE: shell CODE:

maturin develop --profile release-with-debug  --features datagen

LANGUAGE: shell CODE:

pytest --benchmark-compare=$COMPARE_ID python/benchmarks -m "not slow"

TITLE: Build Rust core format (debug) DESCRIPTION: This command compiles the Rust core format in debug mode. The debug build includes debugging information and is suitable for development and testing, though it is not optimized for performance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_4

LANGUAGE: bash CODE:

cargo build

TITLE: Download and extract SIFT 1M dataset for vector operations DESCRIPTION: Removes any existing SIFT files and then downloads the sift.tar.gz archive from the specified FTP server. Finally, it extracts the contents of the tarball, preparing the SIFT 1M dataset for vector processing.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_14

LANGUAGE: Bash CODE:

!rm -rf sift* vec_data.lance
!wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
!tar -xzf sift.tar.gz

TITLE: Format and lint Rust code DESCRIPTION: These commands are used to automatically format Rust code according to community standards (cargo fmt) and to perform static analysis for potential issues (cargo clippy). This ensures code quality and consistency.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_3

LANGUAGE: bash CODE:

cargo fmt --all
cargo clippy --all-features --tests --benches

TITLE: Run a specific LanceDB benchmark by name DESCRIPTION: Filters and runs a particular benchmark using pytest's -k flag, allowing substring matching for the benchmark name.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_13

LANGUAGE: shell CODE:

pytest python/benchmarks -k test_ivf_pq_index_search

TITLE: Run LanceDB code linters DESCRIPTION: Executes code linters to check for style violations and potential issues. Language-specific linting can be performed with make lint-python or make lint-rust.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_5

LANGUAGE: shell CODE:

make lint

TITLE: Verify converted Lance dataset content DESCRIPTION: Reads the newly created Lance dataset and converts it back to a Pandas DataFrame to confirm that the data was correctly written and matches the original content.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_5

LANGUAGE: Python CODE:

# make sure it's the same
dataset.to_table().to_pandas()

TITLE: Prepare Dbpedia-entities-openai Dataset DESCRIPTION: This snippet provides shell commands to set up a Python virtual environment, install necessary dependencies from 'requirements.txt', and then generate the Dbpedia-entities-openai dataset in Lance format using 'datagen.py'. It requires Python 3.10 or newer.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/dbpedia-openai/README.md#_snippet_0

LANGUAGE: sh CODE:

# Python 3.10+
python3 -m venv venv
. ./venv/bin/activate

# install dependencies
pip install -r requirements.txt

# Generate dataset in lance format.
./datagen.py

TITLE: Clean LanceDB build artifacts DESCRIPTION: Removes all generated build artifacts and temporary files from the project directory, useful for a clean rebuild.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_9

LANGUAGE: shell CODE:

make clean

TITLE: Query Nearest Neighbors with Specific Features DESCRIPTION: Performs a nearest neighbor search while simultaneously retrieving specific feature columns ('revenue') alongside the vector results. This demonstrates fetching combined data in a single call.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_25

LANGUAGE: python CODE:

sift1m.to_table(columns=["revenue"], nearest={"column": "vector", "q": samples[0], "k": 10}).to_pandas()

TITLE: Create named tags for Lance dataset versions DESCRIPTION: Assigns human-readable tags ('stable', 'nightly') to specific versions (2 and 3) of the Lance dataset. Then, it lists all defined tags, providing aliases for version numbers.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_12

LANGUAGE: Python CODE:

dataset.tags.create("stable", 2)
dataset.tags.create("nightly", 3)
dataset.tags.list()

TITLE: Access Lance dataset using a named tag DESCRIPTION: Loads the Lance dataset by referencing a previously created tag ('stable') instead of a version number, and converts it to a Pandas DataFrame, showcasing tag-based version access.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_13

LANGUAGE: Python CODE:

lance.dataset('/tmp/test.lance', version="stable").to_table().to_pandas()

TITLE: Run LanceDB benchmarks (excluding slow tests) DESCRIPTION: Executes the performance benchmarks located in python/benchmarks, skipping tests explicitly marked as 'slow'. These benchmarks are designed for quick iteration and regression catching.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_11

LANGUAGE: shell CODE:

pytest python/benchmarks -m "not slow"

TITLE: Verify overwritten Lance dataset content DESCRIPTION: Reads the current state of the Lance dataset and converts it to a Pandas DataFrame to confirm that the overwrite operation was successful and the dataset now contains the new data.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_8

LANGUAGE: Python CODE:

dataset.to_table().to_pandas()

TITLE: Rust: Load Tokenizer from Hugging Face Hub DESCRIPTION: This function provides a utility to load a tokenizer from the Hugging Face Hub. It takes a model name, creates an API client, retrieves the tokenizer file from the specified repository, and constructs a Tokenizer object from it. This is a common pattern for integrating Hugging Face models.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_3

LANGUAGE: Rust CODE:

fn load_tokenizer(model_name: &str) -> Result<Tokenizer, Box<dyn Error + Send + Sync>> {
    let api = Api::new()?;
    let repo = api.repo(Repo::with_revision(
        model_name.into(),
        RepoType::Model,
        "main".into(),
    ));

    let tokenizer_path = repo.get("tokenizer.json")?;
    let tokenizer = Tokenizer::from_file(tokenizer_path)?;

    Ok(tokenizer)
}

TITLE: Sample query vectors from Lance dataset using DuckDB DESCRIPTION: Imports duckdb and queries the sift1m Lance dataset to sample 100 vectors from the 'vector' column. The sampled vectors are converted to a Pandas DataFrame column, to be used as query inputs for KNN search.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_17

LANGUAGE: Python CODE:

import duckdb
# if this segfaults make sure duckdb v0.7+ is installed
samples = duckdb.query("SELECT vector FROM sift1m USING SAMPLE 100").to_df().vector
samples

TITLE: Prepare Parquet file for conversion to Lance DESCRIPTION: Cleans up previous test files. Converts the Pandas DataFrame df to a PyArrow Table, then writes it to a Parquet file. Finally, it reads the Parquet file back into a PyArrow dataset and converts it to a Pandas DataFrame for display.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_3

LANGUAGE: Python CODE:

shutil.rmtree("/tmp/test.parquet", ignore_errors=True)
shutil.rmtree("/tmp/test.lance", ignore_errors=True)

tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, "/tmp/test.parquet", format='parquet')

parquet = pa.dataset.dataset("/tmp/test.parquet")
parquet.to_table().to_pandas()

TITLE: Access a specific historical version of Lance dataset (Version 2) DESCRIPTION: Loads another specific historical version (version 2) of the Lance dataset and converts it to a Pandas DataFrame, further illustrating the versioning capabilities.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_11

LANGUAGE: Python CODE:

lance.dataset('/tmp/test.lance', version=2).to_table().to_pandas()

TITLE: Lance I/O Trace Events DESCRIPTION: Describes events emitted during significant I/O operations, particularly those related to indices, useful for debugging cache utilization.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/performance.md#_snippet_1

LANGUAGE: APIDOC CODE:

Event: lance::io_events
  Parameter: type
    Description: The type of I/O operation (open_scalar_index, open_vector_index, load_vector_part, load_scalar_part)

TITLE: Import libraries and define dataset paths for Flickr8k DESCRIPTION: This snippet imports essential Python libraries such as os, cv2, lance, pyarrow, matplotlib, and tqdm. It also defines the file paths for the Flickr8k captions file and the image dataset folder, which are crucial for subsequent data processing. It assumes the dataset and required libraries like pyarrow, pylance, opencv, and tqdm are already installed and present.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_0

LANGUAGE: python CODE:

import os
import cv2
import random

import lance
import pyarrow as pa

import matplotlib.pyplot as plt

from tqdm.auto import tqdm

captions = "Flickr8k.token.txt"
image_folder = "Flicker8k_Dataset/"

TITLE: Build IVF_PQ index on Lance vector dataset DESCRIPTION: Builds an IVF_PQ (Inverted File Index with Product Quantization) index on the 'vector' column of the sift1m dataset. It configures the index with 256 partitions and 16 sub-vectors for efficient approximate nearest neighbor search, significantly speeding up vector queries.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_19

LANGUAGE: Python CODE:

%%time

sift1m.create_index(
    "vector",
    index_type="IVF_PQ", # IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ are supported
    num_partitions=256,  # IVF
    num_sub_vectors=16  # PQ
)

TITLE: Python Environment Setup for LanceDB Testing DESCRIPTION: Sets up the Python environment by ensuring the project's root directory is added to sys.path and preventing bytecode generation. This is crucial for module imports within the project structure.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_0

LANGUAGE: python CODE:

import sys
sys.dont_write_bytecode = True

import os

module_path = os.path.abspath(os.path.join('.'))
if module_path not in sys.path:
    sys.path.append(module_path)

TITLE: Add Metadata Columns to Lance Table DESCRIPTION: Appends new feature columns, 'item_id' and 'revenue', to an existing Lance table. This illustrates how to enrich dataset entries with additional metadata before writing them back.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_23

LANGUAGE: python CODE:

tbl = sift1m.to_table()
tbl = tbl.append_column("item_id", pa.array(range(len(tbl))))
tbl = tbl.append_column("revenue", pa.array((np.random.randn(len(tbl))+5)*1000))
tbl.to_pandas()

TITLE: Build MacOS x86_64 Wheels DESCRIPTION: This command builds release-mode wheels specifically for x86_64 MacOS. It uses maturin to compile the project for the x86_64-apple-darwin target, storing the resulting wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_26

LANGUAGE: Shell CODE:

maturin build --release \
    --target x86_64-apple-darwin \
    --out wheels

TITLE: Overwrite Lance dataset to create new version DESCRIPTION: Creates a new Pandas DataFrame with different data. Converts it to a PyArrow Table and overwrites the existing Lance dataset at /tmp/test.lance using mode="overwrite", effectively creating a new version of the dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_7

LANGUAGE: Python CODE:

df = pd.DataFrame({"a": [50, 100]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="overwrite")

TITLE: Run Dbpedia-entities-openai Benchmark DESCRIPTION: This command executes the 'benchmarks.py' script to run top-k vector queries. The script tests various combinations of IVF and PQ values, as well as 'refine_factor', to evaluate performance. The example specifies a top-k value of 20.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/dbpedia-openai/README.md#_snippet_1

LANGUAGE: sh CODE:

./benchmarks.py -k 20

TITLE: Build and Test Lance Rust Package DESCRIPTION: These commands clone the Lance repository, navigate to the Rust directory, and then build, test, and benchmark the core Rust components of Lance.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_2

LANGUAGE: bash CODE:

git checkout https://github.com/lancedb/lance.git

# Build rust package
cd rust
cargo build

# Run test
cargo test

# Run benchmarks
cargo bench

TITLE: Query Lance Dataset with Simple SQL in Python DataFusion DESCRIPTION: This Python example shows how to integrate Lance datasets with DataFusion using FFILanceTableProvider from pylance. It demonstrates registering a Lance dataset as a table and executing a basic SQL SELECT query to fetch the first 10 rows, highlighting the Python FFI integration.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_2

LANGUAGE: python CODE:

from datafusion import SessionContext # pip install datafusion
from lance import FFILanceTableProvider

ctx = SessionContext()

table1 = FFILanceTableProvider(
    my_lance_dataset, with_row_id=True, with_row_addr=True
)
ctx.register_table_provider("table1", table1)
ctx.table("table1")
ctx.sql("SELECT * FROM table1 LIMIT 10")

TITLE: Open a LanceDB Dataset DESCRIPTION: Provides a basic example of how to open an existing Lance dataset using the lance.dataset function. This function can be used to access datasets stored locally or in cloud storage like S3.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_11

LANGUAGE: python CODE:

import lance
ds = lance.dataset("s3://bucket/path/imagenet.lance")

TITLE: Build LanceDB in development mode DESCRIPTION: Builds the Rust native module in place using maturin. This command needs to be re-run whenever Rust code changes, but is not required for Python code modifications.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_0

LANGUAGE: shell CODE:

maturin develop

TITLE: Lance File Audit Trace Events DESCRIPTION: Details the events emitted when significant files are created or deleted in Lance, including the mode of I/O operation and the type of file affected.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/performance.md#_snippet_0

LANGUAGE: APIDOC CODE:

Event: lance::file_audit
  Parameter: mode
    Description: The mode of I/O operation (create, delete, delete_unverified)
  Parameter: type
    Description: The type of file affected (manifest, data file, index file, deletion file)

TITLE: Download Lindera Language Model DESCRIPTION: Command-line instruction to download a specific Lindera language model (e.g., ipadic, ko-dic, unidic) for LanceDB. Note that lindera-cli must be installed beforehand as Lindera models require compilation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_4

LANGUAGE: bash CODE:

python -m lance.download lindera -l [ipadic|ko-dic|unidic]

TITLE: Access a specific historical version of Lance dataset (Version 1) DESCRIPTION: Loads a specific historical version (version 1) of the Lance dataset and converts it to a Pandas DataFrame, demonstrating the ability to revert to or inspect past states of the data.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_10

LANGUAGE: Python CODE:

lance.dataset('/tmp/test.lance', version=1).to_table().to_pandas()

TITLE: Decorate Rust Unit Test for Tracing DESCRIPTION: To enable tracing for a Rust unit test, decorate it with the #[lance_test_macros::test] attribute. This macro wraps any existing test attributes, allowing tracing information to be collected during test execution.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_16

LANGUAGE: Rust CODE:

#[lance_test_macros::test(tokio::test)]
async fn test() {
    ...
}

TITLE: Add Rust Toolchain Targets for Cross-Compilation DESCRIPTION: To build manylinux wheels for different Linux architectures, you must first add the corresponding Rust toolchain targets. These commands add the x86_64 and aarch64 GNU targets, enabling cross-compilation.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_22

LANGUAGE: Shell CODE:

rustup target add x86_64-unknown-linux-gnu
rustup target add aarch64-unknown-linux-gnu

TITLE: Query Vectors and Metadata Together in LanceDB DESCRIPTION: This code demonstrates how to perform a nearest neighbor search in LanceDB while simultaneously retrieving specified metadata columns. It allows users to fetch both vector embeddings and associated feature data ('item_id', 'revenue') in a single query, streamlining data retrieval for applications requiring both.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_21

LANGUAGE: python CODE:

result = sift1m.to_table(
    columns=["item_id", "revenue"],
    nearest={"column": "vector", "q": samples[0], "k": 10}
)
print(result.to_pandas())

TITLE: Build MacOS ARM64 Wheels DESCRIPTION: This command builds release-mode wheels specifically for ARM64 (aarch64) MacOS. It uses maturin to compile the project for the aarch64-apple-darwin target, storing the resulting wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_25

LANGUAGE: Shell CODE:

maturin build --release \
    --target aarch64-apple-darwin \
    --out wheels

TITLE: Rust: WikiTextBatchReader Next Batch Logic DESCRIPTION: This snippet shows the core logic for the next method of the WikiTextBatchReader. It attempts to build and retrieve the next Parquet reader from a list of available readers. If a reader is successfully built, it's used; otherwise, it handles errors or indicates that no more readers are available.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_1

LANGUAGE: Rust CODE:

                if let Some(builder) = self.parquet_readers[self.current_reader_idx].take() {
                    match builder.build() {
                        Ok(reader) => {
                            self.current_reader = Some(Box::new(reader));
                            self.current_reader_idx += 1;
                            continue;
                        }
                        Err(e) => {
                            return Some(Err(arrow::error::ArrowError::ExternalError(Box::new(e))))
                        }
                    }
                }
            }

            // No more readers available
            return None;
        }

TITLE: Download and Extract SIFT1M Dataset DESCRIPTION: Downloads the SIFT1M dataset, a common benchmark for vector search, and extracts its contents. This is a prerequisite step for running the subsequent vector search examples.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_6

LANGUAGE: shell CODE:

wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz

TITLE: Measure Nearest Neighbor Query Performance DESCRIPTION: Performs multiple nearest neighbor queries on the Lance dataset using a list of sample vectors and measures the average query time. It also prints the resulting table for the last query.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_21

LANGUAGE: python CODE:

import time

tot = 0
for q in samples:
    start = time.time()
    tbl = sift1m.to_table(nearest={"column": "vector", "q": q, "k": 10})
    end = time.time()
    tot += (end - start)

print(f"Avg(sec): {tot / len(samples)}")
print(tbl.to_pandas())

TITLE: Run Rust Unit Test with Tracing Verbosity DESCRIPTION: Execute a Rust unit test with tracing enabled by setting the LANCE_TESTING environment variable to a desired verbosity level (e.g., 'debug', 'info'). This command will generate a JSON trace file in your working directory, which can be viewed in Chrome or Perfetto.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_17

LANGUAGE: Bash CODE:

LANCE_TESTING=debug cargo test dataset::tests::test_create_dataset

TITLE: Build Linux x86_64 Manylinux Wheels DESCRIPTION: This command builds release-mode manylinux wheels for x86_64 Linux. It utilizes maturin with zig for cross-compilation, targeting manylinux2014 compatibility, and outputs the generated wheels to the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_23

LANGUAGE: Shell CODE:

maturin build --release --zig \
    --target x86_64-unknown-linux-gnu \
    --compatibility manylinux2014 \
    --out wheels

TITLE: Append new rows to an existing Lance dataset DESCRIPTION: Creates a new Pandas DataFrame with a single row. Converts it to a PyArrow Table and appends it to the existing Lance dataset at /tmp/test.lance using mode="append".

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_6

LANGUAGE: Python CODE:

df = pd.DataFrame({"a": [10]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="append")

dataset.to_table().to_pandas()

TITLE: Overwrite Lance Dataset DESCRIPTION: This snippet demonstrates how to completely overwrite the data in a Lance dataset, effectively creating a new version. A new Pandas DataFrame is prepared and written to the dataset using mode="overwrite", replacing the previous content while preserving the old version for historical access.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_6

LANGUAGE: python CODE:

df = pd.DataFrame({"a": [50, 100]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="overwrite")

dataset.to_table().to_pandas()

TITLE: Lance Execution Trace Events DESCRIPTION: Outlines events emitted when an execution plan is run, providing insights into query performance, including output rows, I/O operations, bytes read, and index statistics.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/performance.md#_snippet_2

LANGUAGE: APIDOC CODE:

Event: lance::execution
  Parameter: type
    Description: The type of execution event (plan_run is the only type today)
  Parameter: output_rows
    Description: The number of rows in the output of the plan
  Parameter: iops
    Description: The number of I/O operations performed by the plan
  Parameter: bytes_read
    Description: The number of bytes read by the plan
  Parameter: indices_loaded
    Description: The number of indices loaded by the plan
  Parameter: parts_loaded
    Description: The number of index partitions loaded by the plan
  Parameter: index_comparisons
    Description: The number of comparisons performed inside the various indices

TITLE: Example Console Output of CLIP Model Training Progress DESCRIPTION: This snippet shows a typical console output during the training of the CLIP model. It displays the epoch number, the progress bar indicating batch processing, and the reported loss value for each epoch, demonstrating the training's progression and convergence.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_10

LANGUAGE: console CODE:

==================== Epoch: 1 / 2 ====================
loss: 2.0799: 100%|██████████| 253/253 [02:14<00:00,  1.88it/s]

==================== Epoch: 2 / 2 ====================
loss: 1.3064: 100%|██████████| 253/253 [02:10<00:00,  1.94it/s]

TITLE: Convert SIFT Data to Lance Vector Dataset DESCRIPTION: This code demonstrates how to convert the raw SIFT 1M dataset, stored in a binary format, into a Lance vector dataset. It involves reading the binary data, reshaping it into a NumPy array, and then using vec_to_table and lance.write_dataset to store it efficiently for vector search.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_12

LANGUAGE: python CODE:

from lance.vector import vec_to_table
import struct

uri = "vec_data.lance"

with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * 1000000 * 128])).reshape((1000000, 128))
    dd = dict(zip(range(1000000), data))

table = vec_to_table(dd)
lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)

TITLE: Perform KNN Search on Lance Dataset (No Index) DESCRIPTION: This snippet illustrates how to perform a K-Nearest Neighbors (KNN) search on a Lance dataset without utilizing an index. It measures the execution time to highlight the performance implications of a full dataset scan, demonstrating the need for ANN indexes in real-time scenarios.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_15

LANGUAGE: python CODE:

import time

start = time.time()
tbl = sift1m.to_table(columns=["id"], nearest={"column": "vector", "q": samples[0], "k": 10})
end = time.time()

print(f"Time(sec): {end-start}")
print(tbl.to_pandas())

TITLE: Build Linux ARM64 Manylinux Wheels DESCRIPTION: This command builds release-mode manylinux wheels for ARM64 (aarch64) Linux. It uses maturin with zig for cross-compilation, targeting manylinux2014 compatibility, and places the output wheels in the 'wheels' directory.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_24

LANGUAGE: Shell CODE:

maturin build --release --zig \
    --target aarch_64-unknown-linux-gnu \
    --compatibility manylinux2014 \
    --out wheels

TITLE: Overwrite Lance Dataset with New Features DESCRIPTION: Writes the modified table, including newly added feature columns, back to the Lance dataset URI, overwriting the existing dataset. This updates the dataset with enriched data.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_24

LANGUAGE: python CODE:

sift1m = lance.write_dataset(tbl, uri, mode="overwrite")

TITLE: Append Metadata Columns to LanceDB Dataset DESCRIPTION: This Python snippet illustrates how to append additional metadata columns, such as 'item_id' and 'revenue', to an existing LanceDB dataset. This allows for storing and managing feature data alongside vector embeddings within the same dataset, simplifying data management.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_20

LANGUAGE: python CODE:

tbl = sift1m.to_table()
tbl = tbl.append_column("item_id", pa.array(range(len(tbl))))
tbl = tbl.append_column("revenue", pa.array((np.random.randn(len(tbl))+5)*1000))

TITLE: Create Vector Index in LanceDB (IVF_PQ) DESCRIPTION: This code demonstrates how to create a vector index on a LanceDB dataset. It specifies the vector column, index type (IVF_PQ, IVF_HNSW_PQ, IVF_HNSW_SQ are supported), number of partitions for IVF, and number of sub-vectors for PQ. This improves the efficiency of Approximate Nearest Neighbor (ANN) searches.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_16

LANGUAGE: python CODE:

sift1m.create_index(
    "vector",
    index_type="IVF_PQ", # IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ are supported
    num_partitions=256,  # IVF
    num_sub_vectors=16,  # PQ
)

TITLE: Convert SIFT FVECS data to Lance vector dataset DESCRIPTION: Imports vec_to_table from lance.vector and struct. Reads the SIFT base vectors from sift_base.fvecs, unpacks the binary data into a NumPy array, and converts it into a PyArrow Table using vec_to_table. Finally, it writes this table to a Lance dataset named vec_data.lance, optimizing for vector storage.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_15

LANGUAGE: Python CODE:

from lance.vector import vec_to_table
import struct

uri = "vec_data.lance"

with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * 1000000 * 128])).reshape((1000000, 128))
    dd = dict(zip(range(1000000), data))

table = vec_to_table(dd)
lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)

TITLE: Read Lance Dataset in Java DESCRIPTION: This Java snippet demonstrates how to open and access an existing Lance dataset. It uses Dataset.open with the dataset's path and a BufferAllocator to load the dataset. Once opened, it shows how to retrieve basic information such as row count, schema, and version details, providing a starting point for data querying and manipulation.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_3

LANGUAGE: Java CODE:

void readDataset() {
    String datasetPath = ""; // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator()) {
        try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
            dataset.countRows();
            dataset.getSchema();
            dataset.version();
            dataset.latestVersion();
            // access more information
        }
    }
}

TITLE: Execute Python S3 Integration Tests DESCRIPTION: Once local S3 services are running, this command executes the Python S3 integration tests using pytest. The --run-integration flag ensures that tests requiring external services are included, specifically targeting the test_s3_ddb.py file.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_21

LANGUAGE: Shell CODE:

pytest --run-integration python/tests/test_s3_ddb.py

TITLE: Perform Random Access on Lance Dataset in Java DESCRIPTION: This Java example demonstrates how to perform random access queries on a Lance dataset, retrieving specific rows and columns. It opens an existing dataset, specifies a list of row indices and desired column names, and then uses dataset.take to fetch the corresponding data. The results are processed using an ArrowReader to iterate through batches and access individual field values.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_5

LANGUAGE: Java CODE:

void randomAccess() {
    String datasetPath = ""; // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator()) {
        try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
            List<Long> indices = Arrays.asList(1L, 4L);
            List<String> columns = Arrays.asList("id", "name");
            try (ArrowReader reader = dataset.take(indices, columns)) {
                while (reader.loadNextBatch()) {
                    VectorSchemaRoot result = reader.getVectorSchemaRoot();
                    result.getRowCount();

                    for (int i = 0; i < indices.size(); i++) {
                        result.getVector("id").getObject(i);
                        result.getVector("name").getObject(i);
                    }
                }
            }
        }
    }
}

TITLE: Load Subset of Lance Dataset with Projection and Predicates DESCRIPTION: This Python example illustrates how to efficiently load a subset of a Lance dataset into memory. It utilizes column projection (columns), filter push-down (filter), and pagination (limit, offset) to optimize data retrieval for large datasets by reducing I/O.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_14

LANGUAGE: python CODE:

table = ds.to_table(
    columns=["image", "label"],
    filter="label = 2 AND text IS NOT NULL",
    limit=1000,
    offset=3000)

TITLE: Create PyTorch DataLoader from LanceDataset (Unsafe) DESCRIPTION: This example shows how to load a Lance dataset into a PyTorch IterableDataset using lance.torch.data.LanceDataset and then create a standard PyTorch DataLoader. It highlights an inference loop, but notes that this approach is not fork-safe for multiprocessing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_1

LANGUAGE: python CODE:

import torch
import lance.torch.data

# Load lance dataset into a PyTorch IterableDataset.
# with only columns "image" and "prompt".
dataset = lance.torch.data.LanceDataset(
    "diffusiondb_train.lance",
    columns=["image", "prompt"],
    batch_size=128,
    batch_readahead=8,  # Control multi-threading reads.
)

# Create a PyTorch DataLoader
dataloader = torch.utils.data.DataLoader(dataset)

# Inference loop
for batch in dataloader:
    inputs, targets = batch["prompt"], batch["image"]
    outputs = model(inputs)
    ...

TITLE: Manage LanceDB Dataset Tags (Create, Update, Delete, List) DESCRIPTION: This Python example demonstrates how to interact with LanceDataset.tags to manage dataset versions. It covers creating a tag for a specific version, updating its associated version, listing all tags, and finally deleting a tag. It also shows how list_ordered() can be used to retrieve tags in the order they were created or last updated.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tags.md#_snippet_0

LANGUAGE: python CODE:

import lance
ds = lance.dataset("./tags.lance")
print(len(ds.versions()))
# 2
print(ds.tags.list())
# {}
ds.tags.create("v1-prod", 1)
print(ds.tags.list())
# {'v1-prod': {'version': 1, 'manifest_size': ...}}
ds.tags.update("v1-prod", 2)
print(ds.tags.list())
# {'v1-prod': {'version': 2, 'manifest_size': ...}}
ds.tags.delete("v1-prod")
print(ds.tags.list())
# {}
print(ds.tags.list_ordered())
# []
ds.tags.create("v1-prod", 1)
print(ds.tags.list_ordered())
# [('v1-prod', {'version': 1, 'manifest_size': ...})]
ds.tags.update("v1-prod", 2)
print(ds.tags.list_ordered())
# [('v1-prod', {'version': 2, 'manifest_size': ...})]
ds.tags.delete("v1-prod")
print(ds.tags.list_ordered())
# []

TITLE: Write Pandas DataFrame to Lance Dataset DESCRIPTION: Removes any existing Lance dataset at /tmp/test.lance to ensure a clean write. Then, it writes the Pandas DataFrame df to a new Lance dataset and converts the resulting dataset back to a Pandas DataFrame for verification.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_2

LANGUAGE: Python CODE:

shutil.rmtree("/tmp/test.lance", ignore_errors=True)

dataset = lance.write_dataset(df, "/tmp/test.lance")
dataset.to_table().to_pandas()

TITLE: Perform K-Nearest Neighbors search without an index DESCRIPTION: Measures the time taken to perform a K-Nearest Neighbors (KNN) search on the sift1m dataset. It queries for the 10 nearest neighbors to the first sampled vector (samples[0]) based on the 'vector' column, demonstrating a full scan approach.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_18

LANGUAGE: Python CODE:

import time

start = time.time()
tbl = sift1m.to_table(columns=["id"], nearest={"column": "vector", "q": samples[0], "k": 10})
end = time.time()

print(f"Time(sec): {end-start}")
print(tbl.to_pandas())

TITLE: Write Pandas DataFrame to Lance Dataset DESCRIPTION: This snippet shows how to persist a Pandas DataFrame into a Lance dataset. It first ensures a clean state by removing any existing file and then uses lance.write_dataset to save the DataFrame, followed by reading it back to confirm the write operation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_2

LANGUAGE: python CODE:

shutil.rmtree("/tmp/test.lance", ignore_errors=True)

dataset = lance.write_dataset(df, "/tmp/test.lance")
dataset.to_table().to_pandas()

TITLE: Join Multiple Lance Datasets with SQL in Rust DataFusion DESCRIPTION: This Rust example illustrates how to register multiple Lance datasets (e.g., 'orders' and 'customers') as separate tables in DataFusion. It then performs a SQL JOIN operation between these tables to combine data based on a common key, demonstrating more complex query capabilities.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_1

LANGUAGE: rust CODE:

use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;

let ctx = SessionContext::new();

ctx.register_table("orders",
    Arc::new(LanceTableProvider::new(
    Arc::new(orders_dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

ctx.register_table("customers",
    Arc::new(LanceTableProvider::new(
    Arc::new(customers_dataset.clone()),
    /* with_row_id */ false,
    /* with_row_addr */ false,
    )))?;

let df = ctx.sql("
    SELECT o.order_id, o.amount, c.customer_name 
    FROM orders o 
    JOIN customers c ON o.customer_id = c.customer_id
    LIMIT 10
").await?;

let result = df.collect().await?;

TITLE: Read ImageURIs into Lance EncodedImageArray DESCRIPTION: This example shows how to use ImageURIArray.read_uris() to load images referenced by URIs into memory. The method returns an EncodedImageArray containing the binary data of the images, enabling direct processing of image content.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_4

LANGUAGE: python CODE:

from lance.arrow import ImageURIArray

relative_path = "images/1.png"
uris = [os.path.join(os.path.dirname(__file__), relative_path)]
ImageURIArray.from_uris(uris).read_uris()
# <lance.arrow.EncodedImageArray object at 0x...>
# [b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00...']

TITLE: Create and Write Lance Dataset from Arrow Stream in Java DESCRIPTION: This Java example illustrates how to create a Lance dataset and populate it with data from an existing Arrow file. It reads bytes from a source path, converts them into an ArrowArrayStream, and then uses Dataset.create with WriteParams to configure writing options like maxRowsPerFile, maxRowsPerGroup, and WriteMode. This method is suitable for ingesting data from Arrow-formatted sources.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_2

LANGUAGE: Java CODE:

void createAndWriteDataset() throws IOException, URISyntaxException {
    Path path = "";     // the original source path
    String datasetPath = "";    // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator();
        ArrowFileReader reader =
            new ArrowFileReader(
                new SeekableReadChannel(
                    new ByteArrayReadableSeekableByteChannel(Files.readAllBytes(path))), allocator);
        ArrowArrayStream arrowStream = ArrowArrayStream.allocateNew(allocator)) {
        Data.exportArrayStream(allocator, reader, arrowStream);
        try (Dataset dataset =
                     Dataset.create(
                             allocator,
                             arrowStream,
                             datasetPath,
                             new WriteParams.Builder()
                                     .withMaxRowsPerFile(10)
                                     .withMaxRowsPerGroup(20)
                                     .withMode(WriteParams.WriteMode.CREATE)
                                     .withStorageOptions(new HashMap<>())
                                     .build())) {
            // access dataset
        }
    }
}

TITLE: Generate Flame Graph from Process ID DESCRIPTION: Generates a flame graph for a running process using its Process ID (PID). This command is used to capture and visualize CPU profiles, helping to identify performance bottlenecks in an application.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_5

LANGUAGE: sh CODE:

flamegraph -p <PID>

TITLE: Create Lance BFloat16 Arrow Array DESCRIPTION: This example illustrates how to construct a BFloat16Array directly using the lance.arrow.bfloat16_array function. It takes a list of floating-point numbers and converts them into an Arrow array with BFloat16 precision, suitable for Arrow-based data processing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_1

LANGUAGE: python CODE:

from lance.arrow import bfloat16_array

bfloat16_array([1.1, 2.1, 3.4])
# <lance.arrow.BFloat16Array object at 0x000000016feb94e0>
# [
#   1.1015625,
#   2.09375,
#   3.40625
# ]

TITLE: Clone LanceDB GitHub Repository DESCRIPTION: Instructions to clone the LanceDB project repository from GitHub to a local machine. This is the first step for setting up the development environment.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_11

LANGUAGE: shell CODE:

git clone https://github.com/lancedb/lance.git

TITLE: Rust Implementation of WikiTextBatchReader DESCRIPTION: This Rust code defines WikiTextBatchReader, a custom implementation of arrow::record_batch::RecordBatchReader. It's designed to read text data from Parquet files, tokenize it using a Tokenizer from the tokenizers crate, and transform it into Arrow RecordBatches. The process_batch method handles tokenization, limits the number of samples, and shuffles the tokenized IDs before creating the final RecordBatch.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_0

LANGUAGE: rust CODE:

use arrow::array::{Array, Int64Builder, ListBuilder, UInt32Array};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchReader;
use futures::StreamExt;
use hf_hub::{api::sync::Api, Repo, RepoType};
use lance::dataset::WriteParams;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use rand::seq::SliceRandom;
use rand::SeedableRng;
use std::error::Error;
use std::fs::File;
use std::io::Write;
use std::sync::Arc;
use tempfile::NamedTempFile;
use tokenizers::Tokenizer;

// Implement a custom stream batch reader
struct WikiTextBatchReader {
    schema: Arc<Schema>,
    parquet_readers: Vec<Option<ParquetRecordBatchReaderBuilder<File>>>,
    current_reader_idx: usize,
    current_reader: Option<Box<dyn RecordBatchReader + Send>>,
    tokenizer: Tokenizer,
    num_samples: u64,
    cur_samples_cnt: u64,
}

impl WikiTextBatchReader {
    fn new(
        parquet_readers: Vec<ParquetRecordBatchReaderBuilder<File>>,
        tokenizer: Tokenizer,
        num_samples: Option<u64>,
    ) -> Result<Self, Box<dyn Error + Send + Sync>> {
        let schema = Arc::new(Schema::new(vec![Field::new(
            "input_ids",
            DataType::List(Arc::new(Field::new("item", DataType::Int64, true))),
            false,
        )]));

        Ok(Self {
            schema,
            parquet_readers: parquet_readers.into_iter().map(Some).collect(),
            current_reader_idx: 0,
            current_reader: None,
            tokenizer,
            num_samples: num_samples.unwrap_or(100_000),
            cur_samples_cnt: 0,
        })
    }

    fn process_batch(
        &mut self,
        input_batch: &RecordBatch,
    ) -> Result<RecordBatch, arrow::error::ArrowError> {
        let num_rows = input_batch.num_rows();
        let mut token_builder = ListBuilder::new(Int64Builder::with_capacity(num_rows * 1024)); // Pre-allocate space
        let mut should_break = false;

        let column = input_batch.column_by_name("text").unwrap();
        let string_array = column
            .as_any()
            .downcast_ref::<arrow::array::StringArray>()
            .unwrap();
        for i in 0..num_rows {
            if self.cur_samples_cnt >= self.num_samples {
                should_break = true;
                break;
            }
            if !Array::is_null(string_array, i) {
                let text = string_array.value(i);
                // Split paragraph into lines
                for line in text.split('
') {
                    if let Ok(encoding) = self.tokenizer.encode(line, true) {
                        let tb_values = token_builder.values();
                        for &id in encoding.get_ids() {
                            tb_values.append_value(id as i64);
                        }
                        token_builder.append(true);
                        self.cur_samples_cnt += 1;
                        if self.cur_samples_cnt % 5000 == 0 {
                            println!("Processed {} rows", self.cur_samples_cnt);
                        }
                        if self.cur_samples_cnt >= self.num_samples {
                            should_break = true;
                            break;
                        }
                    }
                }
            }
        }

        // Create array and shuffle it
        let input_ids_array = token_builder.finish();

        // Create shuffled array by randomly sampling indices
        let mut rng = rand::rngs::StdRng::seed_from_u64(1337);
        let len = input_ids_array.len();
        let mut indices: Vec<u32> = (0..len as u32).collect();
        indices.shuffle(&mut rng);

        // Take values in shuffled order
        let indices_array = UInt32Array::from(indices);
        let shuffled = arrow::compute::take(&input_ids_array, &indices_array, None)?;

        let batch = RecordBatch::try_new(self.schema.clone(), vec![Arc::new(shuffled)]);
        if should_break {
            println!("Stop at {} rows", self.cur_samples_cnt);
            self.parquet_readers.clear();
            self.current_reader = None;
        }

        batch
    }
}

impl RecordBatchReader for WikiTextBatchReader {
    fn schema(&self) -> Arc<Schema> {
        self.schema.clone()
    }
}

impl Iterator for WikiTextBatchReader {
    type Item = Result<RecordBatch, arrow::error::ArrowError>;
    fn next(&mut self) -> Option<Self::Item> {
        loop {
            // If we have a current reader, try to get next batch
            if let Some(reader) = &mut self.current_reader {
                if let Some(batch_result) = reader.next() {
                    return Some(batch_result.and_then(|batch| self.process_batch(&batch)));
                }
            }

            // If no current reader or current reader is exhausted, try to get next reader
            if self.current_reader_idx < self.parquet_readers.len() {

TITLE: Inefficient Row Update by Iteration DESCRIPTION: Provides an example of an inefficient way to update multiple individual rows by iterating through a table and calling update for each row. It notes that a merge insert operation is generally more efficient for bulk updates.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_6

LANGUAGE: python CODE:

import lance

# Change the ages of both Alice and Bob
new_table = pa.Table.from_pylist([{"name": "Alice", "age": 30},
                                  {"name": "Bob", "age": 20}])

# This works, but is inefficient, see below for a better approach
dataset = lance.dataset("./alice_and_bob.lance")
for idx in range(new_table.num_rows):
  name = new_table[0][idx].as_py()
  new_age = new_table[1][idx].as_py()
  dataset.update({"age": new_age}, where=f"name='{name}'")

TITLE: Generate and Merge Columns in Parallel with Ray and Lance DESCRIPTION: This example illustrates how to generate new columns in parallel using Ray and Lance. It defines an Arrow schema, creates an initial dataset with 'id', 'height', and 'weight' columns, and then uses a custom Python function (generate_labels) to add a new 'size_labels' column based on existing 'height' data, demonstrating Lance's add_columns functionality for parallel processing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/ray.md#_snippet_1

LANGUAGE: python CODE:

import pyarrow as pa
from pathlib import Path
import lance

# Define schema
schema = pa.schema([
    pa.field("id", pa.int64()),
    pa.field("height", pa.int64()),
    pa.field("weight", pa.int64()),
])

# Generate initial dataset
ds = (
    ray.data.range(10)  # Create 0-9 IDs
    .map(lambda x: {
        "id": x["id"],
        "height": x["id"] + 5,  # height = id + 5
        "weight": x["id"] * 2   # weight = id * 2
    })
    .write_lance(str(output_path), schema=schema)
)

# Define label generation logic
def generate_labels(batch: pa.RecordBatch) -> pa.RecordBatch:
    heights = batch.column("height").to_pylist()
    size_labels = ["tall" if h > 8 else "medium" if h > 6 else "short" for h in heights]
    return pa.RecordBatch.from_arrays([
        pa.array(size_labels)
    ], names=["size_labels"])

# Add new columns in parallel
lance_ds = lance.dataset(output_path)
add_columns(
    lance_ds,
    generate_labels,
    source_columns=["height"],  # Input columns needed
)

# Display final results
final_df = lance_ds.to_table().to_pandas()
print("\nEnhanced dataset with size labels:\n")
print(final_df.sort_values("id").to_string(index=False))

TITLE: Configure Python Benchmark for Single Iteration Tracing DESCRIPTION: When tracing Python benchmarks, it's often useful to force them to run only once for sensible results. This snippet demonstrates how to use the pedantic API to limit a benchmark to a single iteration and round, ensuring a focused trace.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_19

LANGUAGE: Python CODE:

def run():
    "Put code to benchmark here"
    ...
benchmark.pedantic(run, iterations=1, rounds=1)

TITLE: Enable Tracing for Python Script DESCRIPTION: To trace a Python script, import the trace_to_chrome function from lance.tracing and call it at the beginning of your script, specifying the desired tracing level. A single JSON trace file will be generated upon the script's exit, suitable for Chrome's trace viewer.

SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_18

LANGUAGE: Python CODE:

from lance.tracing import trace_to_chrome

trace_to_chrome(level="debug")

# rest of script

TITLE: LanceDB Encoding Metadata Key Specifications DESCRIPTION: This section provides a detailed specification of the metadata keys used in LanceDB for column-level encoding. It describes each key's type, purpose, example values, and how it's used in Python to configure data storage and optimization.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_8

LANGUAGE: APIDOC CODE:

Metadata Key Specifications:
- lance-encoding:compression
  Type: Compression
  Description: Specifies compression algorithm
  Example Values: zstd
  Example Usage (Python): metadata={"lance-encoding:compression": "zstd"}
- lance-encoding:compression-level
  Type: Compression
  Description: Zstd compression level (1-22)
  Example Values: 3
  Example Usage (Python): metadata={"lance-encoding:compression-level": "3"}
- lance-encoding:blob
  Type: Storage
  Description: Marks binary data (>4MB) for chunked storage
  Example Values: true/false
  Example Usage (Python): metadata={"lance-encoding:blob": "true"}
- lance-encoding:packed
  Type: Optimization
  Description: Struct memory layout optimization
  Example Values: true/false
  Example Usage (Python): metadata={"lance-encoding:packed": "true"}
- lance-encoding:structural-encoding
  Type: Nested Data
  Description: Encoding strategy for nested structures
  Example Values: miniblock/fullzip
  Example Usage (Python): metadata={"lance-encoding:structural-encoding": "miniblock"}

TITLE: Initialize Tokenizer and Load Wikitext Dataset (Python) DESCRIPTION: This snippet initializes a Hugging Face tokenizer (gpt2) and loads the wikitext-103-raw-v1 dataset in streaming mode. The 'streaming=True' argument is crucial for processing large datasets without downloading the entire dataset upfront, allowing samples to be downloaded as needed.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_0

LANGUAGE: python CODE:

import lance
import pyarrow as pa

from datasets import load_dataset
from transformers import AutoTokenizer
from tqdm.auto import tqdm  # optional for progress tracking

tokenizer = AutoTokenizer.from_pretrained('gpt2')

dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', streaming=True)['train']
dataset = dataset.shuffle(seed=1337)

TITLE: Example of Hierarchical Schema Definition DESCRIPTION: This snippet demonstrates a sample schema definition within the LanceDB data format, showcasing primitive types, nested structs, and lists. It illustrates how complex data structures are defined before being flattened into a field list for metadata representation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_6

LANGUAGE: APIDOC CODE:

a: i32
b: struct {
    c: list<i32>
    d: i32
}

TITLE: Define Custom PyTorch Dataset for Lance Data DESCRIPTION: The LanceDataset class extends PyTorch's Dataset to provide an interface for loading data from a Lance dataset. It initializes by loading the specified Lance dataset and setting a block_size for token windows. The __len__ method calculates the total number of possible starting indices, while __getitem__ generates a window of indices and uses the from_indices utility to load and return corresponding 'input_ids' and 'labels' as PyTorch tensors, forming a causal sample for LLM training.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_2

LANGUAGE: python CODE:

class LanceDataset(Dataset):
    def __init__(
        self,
        dataset_path,
        block_size,
    ):
        # Load the lance dataset from the saved path
        self.ds = lance.dataset(dataset_path)
        self.block_size = block_size

        # Doing this so the sampler never asks for an index at the end of text
        self.length = self.ds.count_rows() - block_size

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        """
        Generate a window of indices starting from the current idx to idx+block_size
        and return the tokens at those indices
        """
        window = np.arange(idx, idx + self.block_size)
        sample = from_indices(self.ds, window)

        return {"input_ids": torch.tensor(sample), "labels": torch.tensor(sample)}

TITLE: Complex SQL Filter Expression for Lance Dataset DESCRIPTION: This SQL snippet provides an example of a complex filter expression that can be pushed down to the Lance storage system. It demonstrates the use of IN, AND, OR, NOT, and nested field access for filtering data efficiently at the storage layer.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_16

LANGUAGE: sql CODE:

((label IN [10, 20]) AND (note['email'] IS NOT NULL))
    OR NOT note['created']

TITLE: Tune ANN Search Parameters in LanceDB (nprobes, refine_factor) DESCRIPTION: This code demonstrates how to tune the performance of an Approximate Nearest Neighbor (ANN) search in LanceDB by adjusting 'nprobes' and 'refine_factor'. 'nprobes' controls the number of IVF partitions to search, while 'refine_factor' determines how many vectors are retrieved for re-ranking, balancing latency and recall.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_18

LANGUAGE: python CODE:

%%time

sift1m.to_table(
    nearest={
        "column": "vector",
        "q": samples[0],
        "k": 10,
        "nprobes": 10,
        "refine_factor": 5,
    }
).to_pandas()

TITLE: Querying Lance Datasets with DuckDB in Python DESCRIPTION: This snippet demonstrates how to perform SQL queries on a Lance dataset using DuckDB in Python. It shows examples of selecting all data and calculating the mean of a column, illustrating DuckDB's direct access to Lance datasets via Arrow compatibility.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/duckdb.md#_snippet_0

LANGUAGE: Python CODE:

import duckdb # pip install duckdb

duckdb.query("SELECT * FROM my_lance_dataset")
# ┌─────────────┬─────────┬────────┐
# │   vector    │  item   │ price  │
# │   float[]   │ varchar │ double │
# ├─────────────┼─────────┼────────┤
# │ [3.1, 4.1]  │ foo     │   10.0 │
# │ [5.9, 26.5] │ bar     │   20.0 │
# └─────────────┴─────────┴────────┘

duckdb.query("SELECT mean(price) FROM my_lance_dataset")
# ┌─────────────┐
# │ mean(price) │
# │   double    │
# ├─────────────┤
# │        15.0 │
# └─────────────┘

TITLE: Use Sharded Sampler with LanceDataset for Distributed Training DESCRIPTION: This example illustrates how to integrate lance.sampler.ShardedFragmentSampler with LanceDataset to control the data sampling strategy for distributed training environments. It shows how to configure the sampler with the current process's rank and the total number of processes (world size) for sharded data access.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_3

LANGUAGE: python CODE:

from lance.sampler import ShardedFragmentSampler
from lance.torch.data import LanceDataset

# Load lance dataset into a PyTorch IterableDataset.
# with only columns "image" and "prompt".
dataset = LanceDataset(
    "diffusiondb_train.lance",
    columns=["image", "prompt"],
    batch_size=128,
    batch_readahead=8,  # Control multi-threading reads.
    sampler=ShardedFragmentSampler(
        rank=1,  # Rank of the current process
        world_size=8,  # Total number of processes
    ),
)

TITLE: Filter and Select Columns from Lance Dataset in TensorFlow DESCRIPTION: This example illustrates efficient data loading from Lance into TensorFlow by specifying desired columns and applying filter conditions. It leverages Lance's columnar format for optimized data retrieval, reducing memory and processing overhead.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_1

LANGUAGE: python CODE:

ds = lance.tf.data.from_lance(
    "s3://my-bucket/my-dataset",
    columns=["image", "label"],
    filter="split = 'train' AND collected_time > timestamp '2020-01-01'",
    batch_size=256)

TITLE: Python: Decode EncodedImageArray to FixedShapeImageTensorArray DESCRIPTION: This Python example demonstrates how to load images from URIs into an ImageURIArray, read them into an EncodedImageArray, and then decode them into a FixedShapeImageTensorArray. It also illustrates how to provide a custom TensorFlow-based decoder function for the to_tensor method, allowing for flexible image processing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_5

LANGUAGE: python CODE:

from lance.arrow import ImageURIArray

uris = [os.path.join(os.path.dirname(__file__), "images/1.png")]
encoded_images = ImageURIArray.from_uris(uris).read_uris()
print(encoded_images.to_tensor())

def tensorflow_decoder(images):
    import tensorflow as tf
    import numpy as np

    return np.stack(tf.io.decode_png(img.as_py(), channels=3) for img in images.storage)

print(encoded_images.to_tensor(tensorflow_decoder))

TITLE: Add and Populate Columns with Python UDF in Lance DESCRIPTION: Shows how to add and populate new columns in a Lance dataset using a custom Python function (UDF). The UDF processes data in batches, and the example includes using lance.batch_udf with checkpointing for robust, expensive computations.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_2

LANGUAGE: python CODE:

import lance
import pyarrow as pa
import numpy as np

table = pa.table({"id": pa.array([1, 2, 3])})
dataset = lance.write_dataset(table, "ids")

@lance.batch_udf(checkpoint_file="embedding_checkpoint.sqlite")
def add_random_vector(batch):
    embeddings = np.random.rand(batch.num_rows, 128).astype("float32")
    return pd.DataFrame({"embedding": embeddings})
dataset.add_columns(add_random_vector)

TITLE: Construct OpenAI prompt with context DESCRIPTION: Defines a function create_prompt that takes a query and contextual information to build a structured prompt for a large language model. It dynamically appends context, ensuring the total prompt length stays within a specified token limit for the LLM.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_10

LANGUAGE: python CODE:

def create_prompt(query, context):
    limit = 3750

    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(1, len(context)):
        if len("\n\n---\n\n".join(context.text[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(context.text[:i-1]) +
                prompt_end
            )
            break
        elif i == len(context)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(context.text) +
                prompt_end
            )    
    return prompt

TITLE: Set DYLD_LIBRARY_PATH for Lance Python Debugging in LLDB DESCRIPTION: Configures the DYLD_LIBRARY_PATH environment variable specifically for debugging Lance Python projects within LLDB. This ensures that the dynamic linker can find necessary shared libraries located in the third-party distribution directory.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_1

LANGUAGE: lldb CODE:

# /path/to/lance/python/.lldbinit
env DYLD_LIBRARY_PATH=/path/to/thirdparty/dist/lib:${DYLD_LIBRARY_PATH}

TITLE: Rename Top-Level Columns in LanceDB Dataset DESCRIPTION: This snippet illustrates how to rename top-level columns in a LanceDB dataset using the lance.LanceDataset.alter_columns method. It shows a simple example of changing a column name and verifying the change by printing the dataset as a Pandas DataFrame.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_5

LANGUAGE: python CODE:

table = pa.table({"id": pa.array([1, 2, 3])})
dataset = lance.write_dataset(table, "ids")
dataset.alter_columns({"path": "id", "name": "new_id"})
print(dataset.to_table().to_pandas())
#    new_id
# 0       1
# 1       2
# 2       3

TITLE: Python: Encode FixedShapeImageTensorArray to EncodedImageArray DESCRIPTION: This Python example shows how to convert a FixedShapeImageTensorArray back into an EncodedImageArray. It first obtains a tensor array by decoding an EncodedImageArray (which was read from URIs) and then calls the to_encoded() method. This process is useful for saving processed images back into a compressed format.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_6

LANGUAGE: python CODE:

from lance.arrow import ImageURIArray

uris = [image_uri]
tensor_images = ImageURIArray.from_uris(uris).read_uris().to_tensor()
tensor_images.to_encoded()

TITLE: Initialize LLM Training Environment with GPT2 and Lance DESCRIPTION: This snippet imports essential libraries for LLM training, including Lance, PyTorch, and Hugging Face Transformers. It initializes the GPT2 tokenizer and model from pre-trained weights. Key hyperparameters such as learning rate, epochs, block size, batch size, device, and the Lance dataset path are defined, preparing the environment for subsequent data loading and model training.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_0

LANGUAGE: python CODE:

import numpy as np
import lance

import torch
from torch.utils.data import Dataset, DataLoader, Sampler

from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm.auto import tqdm

# We'll be training the pre-trained GPT2 model in this example
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Also define some hyperparameters
lr = 3e-4
nb_epochs = 10
block_size = 1024
batch_size = 8
device = 'cuda:0'
dataset_path = 'wikitext_500K.lance'

TITLE: Define context window and stride parameters DESCRIPTION: Initializes window and stride variables for creating rolling contextual windows from text data. These parameters define the size of each context (number of sentences) and the step size for generating subsequent contexts, respectively.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_3

LANGUAGE: python CODE:

import numpy as np
import pandas as pd

window = 20
stride = 4

TITLE: Append New Fragments to an Existing Lance Dataset DESCRIPTION: This example illustrates how to append new data to an existing Lance dataset. It retrieves the current dataset version, uses lance.LanceOperation.Append with the collected fragments, and commits them, ensuring the read_version is correctly set to maintain data consistency during the append operation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_2

LANGUAGE: python CODE:

import lance

ds = lance.dataset(data_uri)
read_version = ds.version # record the read version

op = lance.LanceOperation.Append(schema, all_fragments)
lance.LanceDataset.commit(
    data_uri,
    op,
    read_version=read_version,
)

TITLE: Extract Video Frames from Lance Blob Data in Python DESCRIPTION: This Python example illustrates how to fetch and process large binary video data stored as blobs in a Lance dataset. It uses lance.dataset.LanceDataset.take_blobs to retrieve a BlobFile object, then leverages the av library to open the video and extract frames within a specified time range without loading the entire video into memory.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/blob.md#_snippet_1

LANGUAGE: python CODE:

import av # pip install av
import lance

ds = lance.dataset("./youtube.lance")
start_time, end_time = 500, 1000
blobs = ds.take_blobs([5], "video")
with av.open(blobs[0]) as container:
    stream = container.streams.video[0]
    stream.codec_context.skip_frame = "NONKEY"

    start_time = start_time / stream.time_base
    start_time = start_time.as_integer_ratio()[0]
    end_time = end_time / stream.time_base
    container.seek(start_time, stream=stream)

    for frame in container.decode(stream):
        if frame.time > end_time:
            break
        display(frame.to_image())
        clear_output(wait=True)

TITLE: Perform Approximate Nearest Neighbor (ANN) Search in LanceDB DESCRIPTION: This Python snippet shows how to perform an Approximate Nearest Neighbor (ANN) search on a LanceDB dataset with an existing index. It queries a specified vector column for the 'k' nearest neighbors to a given query vector 'q', measuring the average query time. The result is converted to a Pandas DataFrame for display.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_17

LANGUAGE: python CODE:

sift1m = lance.dataset(uri)

import time

tot = 0
for q in samples:
    start = time.time()
    tbl = sift1m.to_table(nearest={"column": "vector", "q": q, "k": 10})
    end = time.time()
    tot += (end - start)

print(f"Avg(sec): {tot / len(samples)}")
print(tbl.to_pandas())

TITLE: Cast Column Data Types in LanceDB Dataset DESCRIPTION: This snippet explains how to change the data type of a column in a LanceDB dataset using lance.LanceDataset.alter_columns. It notes that this operation rewrites only the affected column's data files and that any existing index on the column will be dropped. An example is provided for converting a float32 embedding column to float16 to save disk space.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_7

LANGUAGE: python CODE:

table = pa.table({
   "id": pa.array([1, 2, 3]),
   "embedding": pa.FixedShapeTensorArray.from_numpy_ndarray(
       np.random.rand(3, 128).astype("float32"))
})
dataset = lance.write_dataset(table, "embeddings")
dataset.alter_columns({"path": "embedding",
                       "data_type": pa.list_(pa.float16(), 128)})
print(dataset.schema)
# id: int64
# embedding: fixed_size_list<item: halffloat>[128]
#   child 0, item: halffloat

TITLE: Call OpenAI Completion API for text generation DESCRIPTION: Defines the complete function to interact with OpenAI's text-davinci-003 model. It sends a given prompt and retrieves the generated text completion, configuring parameters like temperature, max tokens, and presence/frequency penalties for desired output characteristics.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_11

LANGUAGE: python CODE:

def complete(prompt):
    # query text-davinci-003
    res = openai.Completion.create(
        engine='text-davinci-003',
        prompt=prompt,
        temperature=0,
        max_tokens=400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    return res['choices'][0]['text'].strip()

# check that it works
query = "who was the 12th person on the moon and when did they land?"
complete(query)

TITLE: Build LanceDB Java Project with Maven DESCRIPTION: Provides the Maven command to clean and package the entire LanceDB Java project, including its dependencies and sub-modules. This command compiles the Java code and prepares it for deployment.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_9

LANGUAGE: shell CODE:

mvn clean package

TITLE: Import IPython.display for multimedia output DESCRIPTION: Imports the YouTubeVideo class from IPython.display. This class is essential for embedding and displaying YouTube videos directly within an IPython or Jupyter environment, allowing for rich multimedia output.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_13

LANGUAGE: python CODE:

from IPython.display import YouTubeVideo

TITLE: Initialize CLIP Model Instances, Tokenizer, and PyTorch Optimizer DESCRIPTION: This snippet initializes instances of the ImageEncoder, TextEncoder, and Head modules, along with a Hugging Face AutoTokenizer. It then sets up a PyTorch Adam optimizer, explicitly defining separate learning rates for the image encoder, text encoder, and the combined head modules, preparing the model for training.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_7

LANGUAGE: python CODE:

# Define image encoder, image head, text encoder, text head and a tokenizer for tokenizing the caption
img_encoder = ImageEncoder(model_name=Config.img_encoder_model).to('cuda')
img_head = Head(Config.img_embed_dim, Config.projection_dim).to('cuda')

tokenizer = AutoTokenizer.from_pretrained(Config.text_encoder_model)
text_encoder = TextEncoder(model_name=Config.text_encoder_model).to('cuda')
text_head = Head(Config.text_embed_dim, Config.projection_dim).to('cuda')

# Since we are optimizing two different models together, we will define parameters manually
parameters = [
    {"params": img_encoder.parameters(), "lr": Config.img_enc_lr},
    {"params": text_encoder.parameters(), "lr": Config.text_enc_lr},
    {
        "params": itertools.chain(
            img_head.parameters(),
            text_head.parameters(),
        ),
        "lr": Config.head_lr,
    },
]

optimizer = torch.optim.Adam(parameters)

TITLE: Build vector index for LanceDB dataset DESCRIPTION: Creates an IVF_PQ (Inverted File Index with Product Quantization) index on the 'vector' column of the LanceDB dataset. This indexing significantly speeds up similarity search queries, making the retrieval of relevant contexts much faster.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_9

LANGUAGE: python CODE:

ds = ds.create_index("vector",
                     index_type="IVF_PQ", 
                     num_partitions=64,  # IVF
                     num_sub_vectors=96)  # PQ

TITLE: Import necessary modules for Lance and PyTorch deep learning artifact management DESCRIPTION: This snippet imports essential Python libraries required for deep learning artifact management using Lance. It includes os and shutil for file system operations, lance for data storage, pyarrow for schema definition, torch for PyTorch model handling, and collections.OrderedDict for managing model state dictionaries.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_0

LANGUAGE: python CODE:

import os
import shutil
import lance
import pyarrow as pa
import torch
from collections import OrderedDict

TITLE: Download and extract MeCab Ipadic model DESCRIPTION: This snippet downloads the gzipped tarball of the MeCab Ipadic model from GitHub and then extracts its contents using tar. This is the first step in preparing the dictionary for building.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_0

LANGUAGE: bash CODE:

curl -L -o mecab-ipadic-2.7.0-20070801.tar.gz "https://github.com/lindera-morphology/mecab-ipadic/archive/refs/tags/2.7.0-20070801.tar.gz"
tar xvf mecab-ipadic-2.7.0-20070801.tar.gz

TITLE: Process Image Captions and Images for Lance Dataset in Python DESCRIPTION: This Python function process takes a list of image captions, reads corresponding image files, converts them to binary, and yields PyArrow RecordBatches. Each batch contains image_id, binary image data, and a list of captions, preparing data for a Lance dataset. It handles FileNotFoundError for missing images and uses tqdm for progress indication.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_3

LANGUAGE: python CODE:

def process(captions):
    for img_id, img_captions in tqdm(captions):
        try:
            with open(os.path.join(image_folder, img_id), 'rb') as im:
                binary_im = im.read()
                
        except FileNotFoundError:
            print(f"img_id '{img_id}' not found in the folder, skipping.")
            continue
        
        img_id = pa.array([img_id], type=pa.string())
        img = pa.array([binary_im], type=pa.binary())
        capt = pa.array([img_captions], pa.list_(pa.string(), -1))
        
        yield pa.RecordBatch.from_arrays(
            [img_id, img, capt], 
            ["image_id", "image", "captions"]
        )

TITLE: Create Empty Lance Dataset in Java DESCRIPTION: This Java code demonstrates how to create a new, empty Lance dataset at a specified path. It defines the dataset's schema with 'id' (Int32) and 'name' (Utf8) fields, initializes a BufferAllocator, and uses Dataset.create to persist the schema. The snippet also shows how to access dataset version information immediately after creation.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_1

LANGUAGE: Java CODE:

void createDataset() throws IOException, URISyntaxException {
    String datasetPath = tempDir.resolve("write_stream").toString();
    Schema schema =
            new Schema(
                    Arrays.asList(
                            Field.nullable("id", new ArrowType.Int(32, true)),
                            Field.nullable("name", new ArrowType.Utf8())),
                    null);
    try (BufferAllocator allocator = new RootAllocator();) {
        Dataset.create(allocator, datasetPath, schema, new WriteParams.Builder().build());
        try (Dataset dataset = Dataset.create(allocator, datasetPath, schema, new WriteParams.Builder().build());) {
            dataset.version();
            dataset.latestVersion();
        }
    }
}

TITLE: Generate contextual text windows from video transcripts DESCRIPTION: Defines the contextualize function to create overlapping text contexts from video transcripts. It processes each video, combining sentences into windows based on window and stride parameters, and returns a new DataFrame with these generated contexts.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_4

LANGUAGE: python CODE:

def contextualize(raw_df, window, stride):
    def process_video(vid):
        # For each video, create the text rolling window
        text = vid.text.values
        time_end = vid["end"].values
        contexts = vid.iloc[:-window:stride, :].copy()
        contexts["text"] = [' '.join(text[start_i:start_i+window])
                            for start_i in range(0, len(vid)-window, stride)]
        contexts["end"] = [time_end[start_i+window-1]
                            for start_i in range(0, len(vid)-window, stride)]        
        return contexts
    # concat result from all videos
    return pd.concat([process_video(vid) for _, vid in raw_df.groupby("title")])

df = contextualize(data.to_pandas(), 20, 4)

TITLE: Display answer and relevant YouTube video segment DESCRIPTION: Executes the full Q&A pipeline: poses a query, retrieves the answer and relevant context, prints the generated answer, and then displays the most relevant YouTube video segment using YouTubeVideo at the precise timestamp where the context was found.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_14

LANGUAGE: python CODE:

query = ("Which training method should I use for sentence transformers "
                     "when I only have pairs of related sentences?")
completion, context = answer(query)

print(completion)
top_match = context.iloc[0]
YouTubeVideo(top_match["url"].split("/")[-1], start=top_match["start"])

TITLE: Create LanceDB dataset from embeddings and contexts DESCRIPTION: Converts the generated embeddings into a LanceDB vector table and combines it with the original contextualized DataFrame. This process creates a new LanceDB dataset named 'chatbot.lance' on disk, ready for efficient vector search operations.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_8

LANGUAGE: python CODE:

import lance
import pyarrow as pa
from lance.vector import vec_to_table

table = vec_to_table(np.array(embeds))
combined = pa.Table.from_pandas(df).append_column("vector", table["vector"])
ds = lance.write_dataset(combined, "chatbot.lance")

TITLE: Create LanceDB Index for GIST-1M Dataset DESCRIPTION: Builds an index on the GIST-1M Lance dataset using index.py. The specified parameters for IVF partitions (-i) and PQ subvectors (-p) are crucial for optimizing query performance.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_11

LANGUAGE: sh CODE:

./index.py ./.lancedb/gist1m.lance -i 256 -p 120

TITLE: Generate Lance Dataset DESCRIPTION: This command executes the datagen.py script to create the Lance dataset required for the Cohere wiki text embedding benchmark.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/wiki/README.md#_snippet_0

LANGUAGE: bash CODE:

python datagen.py

TITLE: Generate answer using vector search and LLM DESCRIPTION: Combines embedding generation, LanceDB vector search, and prompt creation to answer a question. It first embeds the query, then finds the most relevant contexts using vector similarity in LanceDB, and finally uses an LLM to formulate an answer based on those retrieved contexts.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_12

LANGUAGE: python CODE:

def answer(question):
    emb = embed_func(query)[0]
    context = ds.to_table(
        nearest={
            "column": "vector",
            "k": 3,
            "q": emb,
            "nprobes": 20,
            "refine_factor": 100
        }).to_pandas()
    prompt = create_prompt(question, context)
    return complete(prompt), context.reset_index()

TITLE: Create LanceDB Index for SIFT-1M Dataset DESCRIPTION: Builds an index on the SIFT-1M Lance dataset using index.py. The specified parameters for IVF partitions (-i) and PQ subvectors (-p) are crucial for optimizing query performance.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_6

LANGUAGE: sh CODE:

./index.py ./.lancedb/sift1m.lance -i 256 -p 16

TITLE: LanceDB Manifest Naming Schemes DESCRIPTION: Describes the V1 (legacy) and V2 (new) naming conventions for manifest files in LanceDB, emphasizing the V2 scheme's zero-padded, descending-sortable versioning for efficient latest manifest retrieval.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_10

LANGUAGE: APIDOC CODE:

Manifest Naming Schemes:
  V1: _versions/{version}.manifest
  V2: _versions/{u64::MAX - version:020}.manifest

TITLE: Initialize LanceDB Dataset and PyTorch DataLoader DESCRIPTION: This snippet demonstrates how to initialize a CLIPLanceDataset using a LanceDB file (flickr8k.lance) and then wrap it with a PyTorch DataLoader. It configures the dataset with tokenization and augmentations, and the dataloader for efficient batch processing during training.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_8

LANGUAGE: python CODE:

dataset = CLIPLanceDataset(
    lance_path="flickr8k.lance",
    max_len=Config.max_len,
    tokenizer=tokenizer,
    transforms=train_augments
)

dataloader = DataLoader(
    dataset,
    shuffle=False,
    batch_size=Config.bs,
    pin_memory=True
)

TITLE: Run GIST-1M Benchmark and Store Results DESCRIPTION: Executes the benchmark for GIST-1M using metrics.py, querying the indexed dataset with specified parameters like number of results to fetch (-k) and query vectors (-q). The results, including mean query time and recall@1, are saved to a CSV file.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_12

LANGUAGE: sh CODE:

./metrics.py ./.lancedb/gist1m.lance results-gist.csv -i 256 -p 120 -q ./.lancedb/gist_query.lance -k 1

TITLE: Run SIFT-1M Benchmark and Store Results DESCRIPTION: Executes the benchmark for SIFT-1M using metrics.py, querying the indexed dataset with specified parameters like number of results to fetch (-k) and query vectors (-q). The results, including mean query time and recall@1, are saved to a CSV file.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_7

LANGUAGE: sh CODE:

./metrics.py ./.lancedb/sift1m.lance results-sift.csv -i 256 -p 16 -q ./.lancedb/sift_query.lance -k 1

TITLE: Object Store General Configuration Options DESCRIPTION: Details configuration parameters applicable to all object stores, including network, security, and retry settings. These options control connection behavior, certificate validation, timeouts, user agents, proxy usage, and client-side retry logic.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_2

LANGUAGE: APIDOC CODE:

Key: allow_http
Description: Allow non-TLS, i.e. non-HTTPS connections. Default, False.

Key: download_retry_count
Description: Number of times to retry a download. Default, 3. This limit is applied when the HTTP request succeeds but the response is not fully downloaded, typically due to a violation of request_timeout.

Key: allow_invalid_certificates
Description: Skip certificate validation on https connections. Default, False. Warning: This is insecure and should only be used for testing.

Key: connect_timeout
Description: Timeout for only the connect phase of a Client. Default, 5s.

Key: request_timeout
Description: Timeout for the entire request, from connection until the response body has finished. Default, 30s.

Key: user_agent
Description: User agent string to use in requests.

Key: proxy_url
Description: URL of a proxy server to use for requests. Default, None.

Key: proxy_ca_certificate
Description: PEM-formatted CA certificate for proxy connections

Key: proxy_excludes
Description: List of hosts that bypass proxy. This is a comma separated list of domains and IP masks. Any subdomain of the provided domain will be bypassed. For example, example.com, 192.168.1.0/24 would bypass https://api.example.com, https://www.example.com, and any IP in the range 192.168.1.0/24.

Key: client_max_retries
Description: Number of times for a s3 client to retry the request. Default, 10.

Key: client_retry_timeout
Description: Timeout for a s3 client to retry the request in seconds. Default, 180.

TITLE: Import necessary libraries for CLIP model training DESCRIPTION: This snippet imports essential Python libraries like cv2, lance, numpy, torch, timm, and transformers, which are required for building and training a multi-modal CLIP model. It also includes utility libraries such as itertools and tqdm, and a warning filter.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_0

LANGUAGE: python CODE:

import cv2
import lance

import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

import timm
from transformers import AutoModel, AutoTokenizer

import itertools
from tqdm import tqdm

import warnings
warnings.simplefilter('ignore')

TITLE: Build user dictionary with Lindera DESCRIPTION: This command demonstrates how to build a custom user dictionary using 'lindera build'. It takes a CSV file as input and creates a new user dictionary, which can be used to extend the base language model.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_2

LANGUAGE: bash CODE:

lindera build --build-user-dictionary --dictionary-kind=ipadic user_dict/userdict.csv user_dict2

TITLE: Google Cloud Storage Configuration Keys DESCRIPTION: Reference for configuration keys available for Google Cloud Storage when used with LanceDB. These keys can be set as environment variables or within the storage_options parameter.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_8

LANGUAGE: APIDOC CODE:

Key / Environment Variable | Description
--------------------------|------------
google_service_account / service_account | Path to the service account JSON file.
google_service_account_key / service_account_key | The serialized service account key.
google_application_credentials / application_credentials | Path to the application credentials.

TITLE: Load YouTube transcription dataset DESCRIPTION: Downloads and loads the 'jamescalam/youtube-transcriptions' dataset from Hugging Face datasets. The 'train' split is specified to retrieve the main training portion of the data.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_1

LANGUAGE: python CODE:

from datasets import load_dataset

data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data

TITLE: Index Lance Data for Benchmarking DESCRIPTION: This command runs the index.py script to build an index on the generated Lance dataset. It configures the index with an L2 metric, 2048 partitions, and 96 sub-vectors for optimized benchmarking.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/wiki/README.md#_snippet_1

LANGUAGE: bash CODE:

python index.py --metric L2 --num-partitions 2048 --num-sub-vectors 96

TITLE: Jieba User Dictionary Configuration File (config.json) DESCRIPTION: JSON configuration for Jieba user dictionaries. This file, named config.json, specifies an optional 'main' dictionary and an array of paths to additional 'users' dictionary files. It should be placed in the model's root directory.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_3

LANGUAGE: json CODE:

{
    "main": "dict.txt",
    "users": ["path/to/user/dict.txt"]
}

TITLE: Batch and generate embeddings using OpenAI API DESCRIPTION: Configures the OpenAI API key and defines a to_batches helper function for processing data in chunks. It then uses this to generate embeddings for the contextualized text in batches, improving efficiency and adhering to API best practices by reducing individual API calls.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_7

LANGUAGE: python CODE:

from tqdm.auto import tqdm
import math

openai.api_key = "sk-..."

# We request in batches rather than 1 embedding at a time
def to_batches(arr, batch_size):
    length = len(arr)
    def _chunker(arr):
        for start_i in range(0, len(df), batch_size):
            yield arr[start_i:start_i+batch_size]
    # add progress meter
    yield from tqdm(_chunker(arr), total=math.ceil(length / batch_size))
    
batch_size = 1000
batches = to_batches(df.text.values.tolist(), batch_size)
embeds = [emb for c in batches for emb in rate_limited(c)]

TITLE: Download Jieba Language Model DESCRIPTION: Command-line instruction to download the Jieba language model for use with LanceDB. The model will be automatically stored in the default Jieba model directory within the configured language model home.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_1

LANGUAGE: bash CODE:

python -m lance.download jieba

TITLE: Read and Inspect Lance Dataset in Rust DESCRIPTION: This Rust function read_dataset shows how to open an existing Lance dataset from a given path. It uses a scanner to create a batch_stream and then iterates through each RecordBatch, printing its number of rows, columns, schema, and the entire batch content.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_1

LANGUAGE: Rust CODE:

// Reads dataset from the given path and prints batch size, schema for all record batches. Also extracts and prints a slice from the first batch
async fn read_dataset(data_path: &str) {
    let dataset = Dataset::open(data_path).await.unwrap();
    let scanner = dataset.scan();

    let mut batch_stream = scanner.try_into_stream().await.unwrap().map(|b| b.unwrap());

    while let Some(batch) = batch_stream.next().await {
        println!("Batch size: {}, {}", batch.num_rows(), batch.num_columns()); // print size of batch
        println!("Schema: {:?}", batch.schema()); // print schema of recordbatch

        println!("Batch: {:?}", batch); // print the entire recordbatch (schema and data)
    }
} // End read dataset

TITLE: Define configuration class for CLIP model hyperparameters DESCRIPTION: This Python class, Config, centralizes all hyperparameters necessary for training the CLIP model. It includes image and text dimensions, learning rates for different components, batch size, maximum sequence length, projection dimensions, temperature, number of epochs, and the names of the image and text encoder models.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_1

LANGUAGE: python CODE:

class Config:
    img_size = (128, 128)
    bs = 32
    head_lr = 1e-3
    img_enc_lr = 1e-4
    text_enc_lr = 1e-5
    max_len = 18
    img_embed_dim = 2048
    text_embed_dim = 768
    projection_dim = 256
    temperature = 1.0
    num_epochs = 2
    img_encoder_model = 'resnet50'
    text_encoder_model = 'bert-base-cased'

TITLE: LanceDB External Manifest Store Reader Operations DESCRIPTION: Explains the reader's load process when an external manifest store is in use, including retrieving the manifest path, reattempting synchronization if needed, and ensuring the dataset remains portable.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_13

LANGUAGE: APIDOC CODE:

External Store Reader Load Process:
  1. GET_EXTERNAL_STORE base_uri, version, path
     - Action: Retrieve manifest path from external store.
     - Condition: If path does not end in UUID, return path.
  2. COPY_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid} mydataset.lance/_versions/{version}.manifest
     - Action: Reattempt synchronization (copy staged to final).
  3. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest
     - Action: Update external store to point to final manifest.
  4. RETURN mydataset.lance/_versions/{version}.manifest
     - Action: Always return the finalized path.
     - Error: Return error if synchronization fails.

TITLE: Generate Text-to-Image 10M Dataset in Lance Format DESCRIPTION: This snippet demonstrates how to create the 'text2image-10m' dataset in Lance format using scripts from the 'big-ann-benchmarks' repository. Upon execution, it generates two Lance datasets: a base dataset and a corresponding queries/ground truth dataset, essential for benchmarking.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/bigann/README.md#_snippet_1

LANGUAGE: bash CODE:

python ./big-ann-benchmarks/create_dataset.py --dataset yfcc-10M
./dataset.py -t text2image-10m data/text2image1B

TITLE: Run Flat Index Search Benchmark DESCRIPTION: Executes the benchmark script to measure performance of flat index search. This command generates benchmark.csv for raw data and benchmark.html for latency plots.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/flat/README.md#_snippet_0

LANGUAGE: Shell CODE:

./benchmark.py

TITLE: PyTorch Model Training Loop with LanceDB DataLoader DESCRIPTION: This snippet illustrates a complete PyTorch training loop. It begins by defining a LanceDataset and LanceSampler to efficiently load data, then sets up a DataLoader. The code proceeds to initialize a PyTorch model and an AdamW optimizer. The core of the snippet is the epoch-based training loop, which includes iterating through batches, performing forward and backward passes, calculating loss, updating model parameters, and reporting training perplexity.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_4

LANGUAGE: python CODE:

dataset = LanceDataset(dataset_path, block_size)
sampler = LanceSampler(dataset, block_size)
dataloader = DataLoader(
    dataset,
    shuffle=False,
    batch_size=batch_size,
    sampler=sampler,
    pin_memory=True
)

# Define the optimizer, training loop and train the model!
model = model.to(device)
model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

for epoch in range(nb_epochs):
    print(f"========= Epoch: {epoch+1} / {nb_epochs} ========")
    epoch_loss = []
    prog_bar = tqdm(dataloader, total=len(dataloader))
    for batch in prog_bar:
        optimizer.zero_grad(set_to_none=True)

        # Put both input_ids and labels to the device
        for k, v in batch.items():
            batch[k] = v.to(device)

        # Perform one forward pass and get the loss
        outputs = model(**batch)
        loss = outputs.loss

        # Perform backward pass
        loss.backward()
        optimizer.step()

        prog_bar.set_description(f"loss: {loss.item():.4f}")

        epoch_loss.append(loss.item())

    # Calculate training perplexity for this epoch
    try:
        perplexity = np.exp(np.mean(epoch_loss))
    except OverflowError:
        perplexity = float("-inf")

    print(f"train_perplexity: {perplexity}")

TITLE: Create PyArrow RecordBatchReader from Processed Samples (Python) DESCRIPTION: This code creates a PyArrow RecordBatchReader, which acts as an iterator over the data generated by the 'process_samples' function. It uses the defined schema to ensure data consistency and prepares the stream of record batches for writing to a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_4

LANGUAGE: python CODE:

reader = pa.RecordBatchReader.from_batches(
    schema, 
    process_samples(dataset, num_samples=500_000, field='text') # For 500K samples
)

TITLE: Download and Extract GIST-1M Dataset DESCRIPTION: Downloads the GIST-1M dataset archive from the specified FTP server and extracts its contents. This is a prerequisite for generating Lance datasets for GIST-1M.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_8

LANGUAGE: sh CODE:

wget ftp://ftp.irisa.fr/local/texmex/corpus/gist.tar.gz
tar -xzf gist.tar.gz

TITLE: Create a Lance Dataset from Arrow RecordBatches in Rust DESCRIPTION: Demonstrates how to write a collection of Arrow RecordBatches and an Arrow Schema into a new Lance Dataset. It uses default write parameters and an iterator for the batches.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_1

LANGUAGE: rust CODE:

use lance::{dataset::WriteParams, Dataset};

let write_params = WriteParams::default();
let mut reader = RecordBatchIterator::new(
    batches.into_iter().map(Ok),
    schema
);
Dataset::write(reader, &uri, Some(write_params)).await.unwrap();

TITLE: Create TensorFlow Dataset from Lance URI DESCRIPTION: This snippet demonstrates how to initialize a tf.data.Dataset directly from a Lance dataset URI using lance.tf.data.from_lance. It also shows how to chain standard TensorFlow dataset operations like shuffling and mapping for data preprocessing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_0

LANGUAGE: python CODE:

import tensorflow as tf
import lance

# Create tf dataset
ds = lance.tf.data.from_lance("s3://my-bucket/my-dataset")

# Chain tf dataset with other tf primitives

for batch in ds.shuffling(32).map(lambda x: tf.io.decode_png(x["image"])): 
    print(batch)

TITLE: Write PyArrow Record Batches to Lance Dataset in Python DESCRIPTION: This Python code demonstrates how to write PyArrow Record Batches to a Lance dataset. It creates a RecordBatchReader from the defined schema and the output of the process function, then uses lance.write_dataset to efficiently persist the data to a file named 'flickr8k.lance' on disk.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_5

LANGUAGE: python CODE:

reader = pa.RecordBatchReader.from_batches(schema, process(captions))
lance.write_dataset(reader, "flickr8k.lance", schema)

TITLE: Implement PyTorch CLIP Model Training Loop DESCRIPTION: This code defines the core training loop for a CLIP model. It sets all model components to training mode, iterates through epochs and batches from the DataLoader, performs forward and backward passes, calculates loss, and updates model weights using an optimizer. A progress bar provides real-time feedback on the training process.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_9

LANGUAGE: python CODE:

img_encoder.train()
img_head.train()
text_encoder.train()
text_head.train()

for epoch in range(Config.num_epochs):
    print(f"{'='*20} Epoch: {epoch+1} / {Config.num_epochs} {'='*20}")

    prog_bar = tqdm(dataloader)
    for img, caption in prog_bar:
        optimizer.zero_grad(set_to_none=True);

        img_embed, text_embed = forward(img, caption)
        loss = loss_fn(img_embed, text_embed, temperature=Config.temperature).mean()

        loss.backward()
        optimizer.step()

        prog_bar.set_description(f"loss: {loss.item():.4f}")
    print()

TITLE: Build Ipadic language model with Lindera DESCRIPTION: This command uses the 'lindera build' tool to compile the Ipadic dictionary. It specifies the dictionary kind as 'ipadic' and points to the extracted model directory to create the main dictionary.

SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_1

LANGUAGE: bash CODE:

lindera build --dictionary-kind=ipadic mecab-ipadic-2.7.0-20070801 main

TITLE: Write Lance Dataset in Rust DESCRIPTION: This Rust function write_dataset demonstrates how to create and write a Lance dataset to a specified path. It defines a schema with UInt32 fields, creates a RecordBatch with sample data, and uses WriteParams to set the write mode to Overwrite before writing the dataset to disk.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_0

LANGUAGE: Rust CODE:

// Writes sample dataset to the given path
async fn write_dataset(data_path: &str) {
    // Define new schema
    let schema = Arc::new(Schema::new(vec![
        Field::new("key", DataType::UInt32, false),
        Field::new("value", DataType::UInt32, false),
    ]));

    // Create new record batches
    let batch = RecordBatch::try_new(
        schema.clone(),
        vec![
            Arc::new(UInt32Array::from(vec![1, 2, 3, 4, 5, 6])),
            Arc::new(UInt32Array::from(vec![6, 7, 8, 9, 10, 11])),
        ],
    )
    .unwrap();

    let batches = RecordBatchIterator::new([Ok(batch)], schema.clone());

    // Define write parameters (e.g. overwrite dataset)
    let write_params = WriteParams {
        mode: WriteMode::Overwrite,
        ..Default::default()
    };

    Dataset::write(batches, data_path, Some(write_params))
        .await
        .unwrap();
} // End write dataset

TITLE: Download and Extract SIFT-1M Dataset DESCRIPTION: Downloads the SIFT-1M dataset archive from the specified FTP server and extracts its contents. This is a prerequisite for generating Lance datasets for SIFT-1M.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_1

LANGUAGE: sh CODE:

wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz

TITLE: Query Lance Dataset with DuckDB DESCRIPTION: Demonstrates querying a Lance dataset directly using DuckDB. It highlights the integration with DuckDB for SQL-based data exploration and retrieval, enabling powerful analytical queries.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_5

LANGUAGE: python CODE:

import duckdb

# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()

TITLE: Build LanceDB Rust JNI Module DESCRIPTION: Specifies the command to build only the Rust-based JNI (Java Native Interface) module of LanceDB. This is useful for developers focusing on the native components without rebuilding the entire Java project.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_10

LANGUAGE: shell CODE:

cargo build

TITLE: Initialize Lance Dataset from Local Path DESCRIPTION: This Python snippet demonstrates how to initialize a Lance dataset object from a local file path. It sets up the dataset for subsequent read operations, enabling access to the data stored in the specified Lance file.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_12

LANGUAGE: python CODE:

ds = lance.dataset("./imagenet.lance")

TITLE: Implement custom PyTorch Dataset for Lance-based CLIP training DESCRIPTION: This CLIPLanceDataset class extends PyTorch's Dataset to handle Lance datasets for CLIP model training. It initializes with a Lance dataset path, an optional tokenizer, and image transformations, providing methods to retrieve pre-processed images and tokenized captions for use in a DataLoader.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_3

LANGUAGE: python CODE:

class CLIPLanceDataset(Dataset):
    """Custom Dataset to load images and their corresponding captions"""
    def __init__(self, lance_path, max_len=18, tokenizer=None, transforms=None):
        self.ds = lance.dataset(lance_path)
        self.max_len = max_len
        # Init a new tokenizer if not specified already
        self.tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') if not tokenizer else tokenizer
        self.transforms = transforms

    def __len__(self):
        return self.ds.count_rows()

    def __getitem__(self, idx):
        # Load the image and caption
        img = load_image(self.ds, idx)
        caption = load_caption(self.ds, idx)

        # Apply transformations to the images
        if self.transforms:
            img = self.transforms(img)

        # Tokenize the caption
        caption = self.tokenizer(
            caption,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors='pt'
        )
        # Flatten each component of tokenized caption otherwise they will cause size mismatch errors during training
        caption = {k: v.flatten() for k, v in caption.items()}

        return img, caption

TITLE: Azure Blob Storage Configuration Keys DESCRIPTION: Reference for configuration keys available for Azure Blob Storage when used with LanceDB. These keys can be set as environment variables or within the storage_options parameter.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_10

LANGUAGE: APIDOC CODE:

Key / Environment Variable | Description
--------------------------|------------
azure_storage_account_name / account_name | The name of the azure storage account.
azure_storage_account_key / account_key | The serialized service account key.
azure_client_id / client_id | Service principal client id for authorizing requests.
azure_client_secret / client_secret | Service principal client secret for authorizing requests.
azure_tenant_id / tenant_id | Tenant id used in oauth flows.
azure_storage_sas_key / azure_storage_sas_token / sas_key / sas_token | Shared access signature. The signature is expected to be percent-encoded, much like they are provided in the azure storage explorer or azure portal.
azure_storage_token / bearer_token / token | Bearer token.
azure_storage_use_emulator / object_store_use_emulator / use_emulator | Use object store with azurite storage emulator.
azure_endpoint / endpoint | Override the endpoint used to communicate with blob storage.
azure_use_fabric_endpoint / use_fabric_endpoint | Use object store with url scheme account.dfs.fabric.microsoft.com.
azure_msi_endpoint / azure_identity_endpoint / identity_endpoint / msi_endpoint | Endpoint to request a imds managed identity token.
azure_object_id / object_id | Object id for use with managed identity authentication.
azure_msi_resource_id / msi_resource_id | Msi resource id for use with managed identity authentication.
azure_federated_token_file / federated_token_file | File containing token for Azure AD workload identity federation.
azure_use_azure_cli / use_azure_cli | Use azure cli for acquiring access token.
azure_disable_tagging / disable_tagging | Disables tagging objects. This can be desirable if not supported by the backing store.

TITLE: Define Function to Process and Tokenize Samples for Lance (Python) DESCRIPTION: This function iterates over a dataset, tokenizes individual samples using the 'tokenize' function, and yields PyArrow RecordBatches. It processes a specified number of samples, skipping empty ones, and is designed to efficiently prepare data for writing to a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_2

LANGUAGE: python CODE:

def process_samples(dataset, num_samples=100_000, field='text'):
    current_sample = 0
    for sample in tqdm(dataset, total=num_samples):
        # If we have added all 5M samples, stop
        if current_sample == num_samples:
            break
        if not sample[field]:
            continue
        # Tokenize the current sample
        tokenized_sample = tokenize(sample, field)
        # Increment the counter
        current_sample += 1
        # Yield a PyArrow RecordBatch
        yield pa.RecordBatch.from_arrays(
            [tokenized_sample], 
            names=["input_ids"]
        )

TITLE: Read a Lance Dataset and Collect RecordBatches in Rust DESCRIPTION: Opens an existing Lance Dataset from a specified path, scans its content, and collects all resulting RecordBatches into a vector. Error handling is included.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_2

LANGUAGE: rust CODE:

let dataset = Dataset::open(path).await.unwrap();
let mut scanner = dataset.scan();
let batches: Vec<RecordBatch> = scanner
    .try_into_stream()
    .await
    .unwrap()
    .map(|b| b.unwrap())
    .collect::<Vec<RecordBatch>>()
    .await;

TITLE: Visualize Latency vs. NProbes with IVF and PQ DESCRIPTION: This snippet generates a scatter plot using seaborn to visualize the relationship between 'nprobes' and '50%' (median response time). It uses 'ivf' for color encoding and 'pq' for marker style, allowing for a multi-dimensional analysis of performance.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_7

LANGUAGE: python CODE:

sns.scatterplot(data=df, x="nprobes", y="50%", hue="ivf", style="pq")

TITLE: Write HuggingFace Dataset to Lance Format DESCRIPTION: This Python code snippet demonstrates how to load a HuggingFace dataset and write it into the Lance format. It uses the datasets library to load a specific split of a dataset and then lance.write_dataset to save it as a Lance file. Dependencies include datasets and lance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/huggingface.md#_snippet_0

LANGUAGE: python CODE:

import datasets # pip install datasets
import lance

lance.write_dataset(datasets.load_dataset(
    "poloclub/diffusiondb", split="train[:10]"
), "diffusiondb_train.lance")

TITLE: Describe Median Latency by PQ Configuration DESCRIPTION: This snippet groups the DataFrame by the 'pq' column and calculates descriptive statistics for the '50%' (median response time) column. This provides insights into latency performance based on different Product Quantization (PQ) configurations.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_4

LANGUAGE: python CODE:

df.groupby("pq")["50%"].describe()

TITLE: Check number of generated contexts DESCRIPTION: Prints the total number of contextualized entries created after processing the dataset. This helps verify the output of the contextualize function and understand the volume of data prepared for embedding.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_5

LANGUAGE: python CODE:

len(df)

TITLE: Convert HuggingFace Dataset to LanceDB DESCRIPTION: This snippet demonstrates how to load a dataset from HuggingFace and convert it into a Lance dataset using lance.write_dataset. This is a foundational step for preparing data to be used with LanceDB's PyTorch integration.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_0

LANGUAGE: python CODE:

import datasets # pip install datasets
import lance

hf_ds = datasets.load_dataset(
    "poloclub/diffusiondb",
    split="train",
    # name="2m_first_1k",  # for a smaller subset of the dataset
)
lance.write_dataset(hf_ds, "diffusiondb_train.lance")

TITLE: Build IVF_PQ Vector Index on Lance Dataset DESCRIPTION: Creates an IVF_PQ (Inverted File Index with Product Quantization) index on the 'vector' column of the Lance dataset. This index significantly speeds up nearest neighbor searches by efficiently partitioning and quantizing the vector space.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_8

LANGUAGE: python CODE:

sift1m.create_index("vector",
                    index_type="IVF_PQ",
                    num_partitions=256,  # IVF
                    num_sub_vectors=16)  # PQ

TITLE: LanceDB S3 Storage Options Reference DESCRIPTION: Reference for available keys in the storage_options parameter for S3 and S3-compatible storage configurations in LanceDB. These options can be set via environment variables or directly in the storage_options dictionary, controlling aspects like region, endpoint, and encryption.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_4

LANGUAGE: APIDOC CODE:

S3 Storage Options:
- aws_region / region: The AWS region the bucket is in. This can be automatically detected when using AWS S3, but must be specified for S3-compatible stores.
- aws_access_key_id / access_key_id: The AWS access key ID to use.
- aws_secret_access_key / secret_access_key: The AWS secret access key to use.
- aws_session_token / session_token: The AWS session token to use.
- aws_endpoint / endpoint: The endpoint to use for S3-compatible stores.
- aws_virtual_hosted_style_request / virtual_hosted_style_request: Whether to use virtual hosted-style requests, where bucket name is part of the endpoint. Meant to be used with `aws_endpoint`. Default, `False`.
- aws_s3_express / s3_express: Whether to use S3 Express One Zone endpoints. Default, `False`. See more details below.
- aws_server_side_encryption: The server-side encryption algorithm to use. Must be one of `"AES256"`, `"aws:kms"`, or `"aws:kms:dsse"`. Default, `None`.
- aws_sse_kms_key_id: The KMS key ID to use for server-side encryption. If set, `aws_server_side_encryption` must be `"aws:kms"` or `"aws:kms:dsse"`.
- aws_sse_bucket_key_enabled: Whether to use bucket keys for server-side encryption.

TITLE: Define OpenAI embedding function with rate limiting and retry DESCRIPTION: Sets up an embedding function using OpenAI's text-embedding-ada-002 model. It incorporates ratelimiter to respect API rate limits and retry for robust API calls, ensuring successful embedding generation even with transient network issues.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_6

LANGUAGE: python CODE:

import functools
import openai
import ratelimiter
from retry import retry

embed_model = "text-embedding-ada-002"

# API limit at 60/min == 1/sec
limiter = ratelimiter.RateLimiter(max_calls=0.9, period=1.0)

# Get the embedding with retry
@retry(tries=10, delay=1, max_delay=30, backoff=3, jitter=1)
def embed_func(c):    
    rs = openai.Embedding.create(input=c, engine=embed_model)
    return [record["embedding"] for record in rs["data"]]

rate_limited = limiter(embed_func)

TITLE: Add Lance SDK Java Maven Dependency DESCRIPTION: This snippet provides the Maven XML configuration required to include the LanceDB Java SDK as a dependency in your project. It specifies the groupId, artifactId, and version for the lance-core library, enabling access to LanceDB functionalities.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_0

LANGUAGE: Shell CODE:

<dependency>
    <groupId>com.lancedb</groupId>
    <artifactId>lance-core</artifactId>
    <version>0.18.0</version>
</dependency>

TITLE: Define PyArrow Schema for Lance Dataset (Python) DESCRIPTION: This snippet defines a PyArrow schema required by Lance to understand the structure of the data being written. It specifies that the dataset will contain a single field named 'input_ids', which will store tokenized data as 64-bit integers.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_3

LANGUAGE: python CODE:

schema = pa.schema([
    pa.field("input_ids", pa.int64())
])

TITLE: Add Columns to LanceDB Dataset in Java DESCRIPTION: Demonstrates how to add new columns to a LanceDB dataset. This can be done either by providing SQL expressions to derive new column values or by defining a new Arrow Schema for the dataset, allowing for flexible schema evolution.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_6

LANGUAGE: java CODE:

void addColumnsByExpressions() {
    String datasetPath = ""; // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator()) {
        try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
            SqlExpressions sqlExpressions = new SqlExpressions.Builder().withExpression("double_id", "id * 2").build();
            dataset.addColumns(sqlExpressions, Optional.empty());
        }
    }
}

LANGUAGE: java CODE:

void addColumnsBySchema() {
  String datasetPath = ""; // specify a path point to a dataset
  try (BufferAllocator allocator = new RootAllocator()) {
    try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
      SqlExpressions sqlExpressions = new SqlExpressions.Builder().withExpression("double_id", "id * 2").build();
      dataset.addColumns(new Schema(
          Arrays.asList(
              Field.nullable("id", new ArrowType.Int(32, true)),
              Field.nullable("name", new ArrowType.Utf8()),
              Field.nullable("age", new ArrowType.Int(32, true)))), Optional.empty());
    }
  }
}

TITLE: Write Processed Data to Lance Dataset (Python) DESCRIPTION: This final step uses the 'lance.write_dataset' function to persist the processed and tokenized data to disk as a Lance dataset. It takes the RecordBatchReader, the desired output file path, and the defined schema as arguments, completing the dataset creation process.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_5

LANGUAGE: python CODE:

# Write the dataset to disk
lance.write_dataset(
    reader, 
    "wikitext_500K.lance",
    schema
)

TITLE: Create a Vector Index on a Lance Dataset in Rust DESCRIPTION: Demonstrates how to create a vector index on a specified column (e.g., 'embeddings') within a Lance Dataset. It configures vector index parameters like the number of partitions and sub-vectors, noting potential alignment requirements.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_4

LANGUAGE: rust CODE:

use ::lance::index::vector::VectorIndexParams;

let params = VectorIndexParams::default();
params.num_partitions = 256;
params.num_sub_vectors = 16;

// this will Err if list_size(embeddings) / num_sub_vectors does not meet simd alignment
dataset.create_index(&["embeddings"], IndexType::Vector, None, &params, true).await;

TITLE: Load Query Data with Pandas DESCRIPTION: This snippet imports the pandas library and loads query performance data from a CSV file named 'query.csv' into a DataFrame. This DataFrame will be used for subsequent analysis and visualization.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_0

LANGUAGE: python CODE:

import pandas as pd
df = pd.read_csv("query.csv")

TITLE: Query Lance Dataset with Pandas DESCRIPTION: Illustrates how to convert a Lance dataset to a PyArrow Table and then to a Pandas DataFrame for easy data manipulation and analysis using familiar Pandas operations.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_4

LANGUAGE: python CODE:

df = dataset.to_table().to_pandas()
df

TITLE: Lance Manifest Protobuf Message Reference DESCRIPTION: References the Protobuf message definition for the Manifest file, which encapsulates the metadata for a specific version of a Lance dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_1

LANGUAGE: APIDOC CODE:

proto.message.Manifest

TITLE: Define Tokenization Function (Python) DESCRIPTION: This function takes a single sample from a Hugging Face dataset and a specified field name (e.g., 'text'). It uses the pre-initialized tokenizer to convert the text content of that field into 'input_ids', which are numerical representations of tokens.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_dataset_creation.md#_snippet_1

LANGUAGE: python CODE:

def tokenize(sample, field='text'):
    return tokenizer(sample[field])['input_ids']

TITLE: Implement CLIP Loss Function and Forward Pass Utilities DESCRIPTION: This snippet provides utility functions for training a CLIP model. The loss_fn calculates the contrastive loss between image and text embeddings based on the CLIP paper, using logits, image similarity, and text similarity. The forward function performs a single forward pass, moving inputs to the GPU, and obtaining image and text embeddings using the defined encoder and head modules.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_6

LANGUAGE: python CODE:

def loss_fn(img_embed, text_embed, temperature=0.2):
    """
    https://arxiv.org/abs/2103.00020/
    """
    # Calculate logits, image similarity and text similarity
    logits = (text_embed @ img_embed.T) / temperature
    img_sim = img_embed @ img_embed.T
    text_sim = text_embed @ text_embed.T
    # Calculate targets by taking the softmax of the similarities
    targets = F.softmax(
        (img_sim + text_sim) / 2 * temperature, dim=-1
    )
    img_loss = (-targets.T * nn.LogSoftmax(dim=-1)(logits.T)).sum(1)
    text_loss = (-targets * nn.LogSoftmax(dim=-1)(logits)).sum(1)
    return (img_loss + text_loss) / 2.0

def forward(img, caption):
    # Transfer to device
    img = img.to('cuda')
    for k, v in caption.items():
        caption[k] = v.to('cuda')

    # Get embeddings for both img and caption
    img_embed = img_head(img_encoder(img))
    text_embed = text_head(text_encoder(caption))

    return img_embed, text_embed

TITLE: Read Data from Lance Dataset DESCRIPTION: Shows how to open and read a Lance dataset from a specified URI. It asserts that the returned object is a PyArrow Dataset, confirming seamless integration with the Apache Arrow ecosystem.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_3

LANGUAGE: python CODE:

dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)

TITLE: Globally Set Object Store Timeout (Bash) DESCRIPTION: Demonstrates how to set a global timeout for object store operations using an environment variable. This configuration applies to all subsequent Lance operations that interact with object storage.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_0

LANGUAGE: bash CODE:

export TIMEOUT=60s

TITLE: Lance File Format Version Details DESCRIPTION: This table provides a comprehensive overview of the Lance file format versions, including their compatibility, features, and stability status. It details the breaking changes, new functionalities introduced in each version, and aliases for common use cases.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_5

LANGUAGE: APIDOC CODE:

Version: 0.1
  Minimal Lance Version: Any
  Maximum Lance Version: Any
  Description: This is the initial Lance format.

Version: 2.0
  Minimal Lance Version: 0.16.0
  Maximum Lance Version: Any
  Description: Rework of the Lance file format that removed row groups and introduced null support for lists, fixed size lists, and primitives

Version: 2.1 (unstable)
  Minimal Lance Version: None
  Maximum Lance Version: Any
  Description: Enhances integer and string compression, adds support for nulls in struct fields, and improves random access performance with nested fields.

Version: legacy
  Minimal Lance Version: N/A
  Maximum Lance Version: N/A
  Description: Alias for 0.1

Version: stable
  Minimal Lance Version: N/A
  Maximum Lance Version: N/A
  Description: Alias for the latest stable version (currently 2.0)

Version: next
  Minimal Lance Version: N/A
  Maximum Lance Version: N/A
  Description: Alias for the latest unstable version (currently 2.1)

TITLE: Connect LanceDB to S3-Compatible Stores (e.g., MinIO) DESCRIPTION: Illustrates how to configure LanceDB to connect to S3-compatible storage solutions like MinIO. This requires specifying both the region and endpoint within the storage_options parameter to direct LanceDB to the custom S3 endpoint, enabling connectivity beyond AWS S3.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_5

LANGUAGE: python CODE:

import lance
ds = lance.dataset(
    "s3://bucket/path",
    storage_options={
        "region": "us-east-1",
        "endpoint": "http://minio:9000",
    }
)

TITLE: Load and parse Flickr8k token file annotations DESCRIPTION: This code reads the 'Flickr8k.token.txt' file, which contains image annotations. It then processes each line to extract the image file name, a unique caption number, and the caption text itself, storing them as structured tuples for further processing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_1

LANGUAGE: python CODE:

with open(captions, "r") as fl:
    annotations = fl.readlines()

# Converts the annotations where each element of this list is a tuple consisting of image file name, caption number and caption itself
annotations = list(map(lambda x: tuple([*x.split('\t')[0].split('#'), x.split('\t')[1]]), annotations))

TITLE: Lance File Footer and Overall Layout Specification DESCRIPTION: Provides a detailed byte-level specification of the .lance file format, including the arrangement of data pages, column metadata, offset tables, and the final footer. It outlines alignment requirements and the structure of various fields.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_4

LANGUAGE: APIDOC CODE:

// Note: the number of buffers (BN) is independent of the number of columns (CN)
//       and pages.
//
//       Buffers often need to be aligned.  64-byte alignment is common when
//       working with SIMD operations.  4096-byte alignment is common when
//       working with direct I/O.  In order to ensure these buffers are aligned
//       writers may need to insert padding before the buffers.
//       
//       If direct I/O is required then most (but not all) fields described
//       below must be sector aligned.  We have marked these fields with an
//       asterisk for clarity.  Readers should assume there will be optional
//       padding inserted before these fields.
//
//       All footer fields are unsigned integers written with  little endian
//       byte order.
//
// ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

TITLE: LanceDB Conflict Resolution Process DESCRIPTION: Outlines the commit process in LanceDB, detailing how writers handle concurrent modifications, create transaction files for conflict detection, and retry commits after checking for compatibility with successful writes.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_11

LANGUAGE: APIDOC CODE:

Commit Process:
  1. Writer finishes writing all data files.
  2. Writer creates a transaction file in _transactions directory.
     - Purpose: detect conflicts, re-build manifest during retries.
  3. Check for new commits since writer started.
     - If conflicts detected (via transaction files), abort commit.
  4. Build manifest and attempt to commit to next version.
     - If commit fails due to concurrent write, go back to step 3.

Conflict Detection:
  - Conservative approach: assume conflict if transaction file is missing or has unknown operation.

TITLE: Lance Dataset Directory Structure DESCRIPTION: Illustrates the typical organization of a Lance dataset within a directory, detailing the location of data files, version manifests, secondary indices, and deletion files.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_0

LANGUAGE: plaintext CODE:

/path/to/dataset:
    data/*.lance  -- Data directory
    _versions/*.manifest -- Manifest file for each dataset version.
    _indices/{UUID-*}/index.idx -- Secondary index, each index per directory.
    _deletions/*.{arrow,bin} -- Deletion files, which contain ids of rows
      that have been deleted.

TITLE: Define PyArrow Schema for Lance Dataset in Python DESCRIPTION: This Python code defines a PyArrow schema for the Lance dataset. It specifies the data types for image_id (string), image (binary), and captions (list of strings), ensuring proper data structure and type consistency for the dataset when written to Lance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_4

LANGUAGE: python CODE:

schema = pa.schema([
    pa.field("image_id", pa.string()),
    pa.field("image", pa.binary()),
    pa.field("captions", pa.list_(pa.string(), -1)),
])

TITLE: Define image augmentations for CLIP model training DESCRIPTION: This snippet defines a torchvision.transforms.Compose object for basic image augmentations applied during CLIP model training. It includes converting images to tensors, resizing them to a consistent shape, and normalizing pixel values to stabilize the training process.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_4

LANGUAGE: python CODE:

train_augments = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Resize(Config.img_size),
        transforms.Normalize([0.5], [0.5]),
    ]
)

TITLE: Generate GIST-1M Database Vectors Lance Dataset DESCRIPTION: Uses the datagen.py script to convert GIST-1M base vectors into a Lance dataset. This dataset will serve as the primary data source for indexing and querying in the GIST-1M benchmark.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_9

LANGUAGE: sh CODE:

./datagen.py ./gist/gist_base.fvecs ./.lancedb/gist1m.lance -g 1024 -m 50000 -d 960

TITLE: Set Object Store Timeout for a Single Dataset (Python) DESCRIPTION: Shows how to specify storage options, such as a timeout, for a specific Lance dataset using the storage_options parameter in lance.dataset. This allows fine-grained control over individual dataset configurations without affecting global settings.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_1

LANGUAGE: python CODE:

import lance
ds = lance.dataset("s3://path", storage_options={"timeout": "60s"})

TITLE: Connect LanceDB to Google Cloud Storage DESCRIPTION: This Python snippet demonstrates how to connect a LanceDB dataset to Google Cloud Storage using storage_options to specify service account credentials. It provides an alternative to setting the GOOGLE_SERVICE_ACCOUNT environment variable.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_7

LANGUAGE: python CODE:

import lance
ds = lance.dataset(
    "gs://my-bucket/my-dataset",
    storage_options={
        "service_account": "path/to/service-account.json",
    }
)

TITLE: Read and Write Lance Data with Ray and Pandas DESCRIPTION: This snippet demonstrates how to write data to the Lance format using Ray's data sink (ray.data.Dataset.write_lance) and subsequently read it back using both the native Lance API (lance.dataset) and Ray's data source (ray.data.read_lance). It includes assertions to verify data integrity after read/write operations.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/ray.md#_snippet_0

LANGUAGE: python CODE:

import ray
import pandas as pd

ray.init()

data = [
    {"id": 1, "name": "alice"},
    {"id": 2, "name": "bob"},
    {"id": 3, "name": "charlie"}
]
ray.data.from_items(data).write_lance("./alice_bob_and_charlie.lance")

# It can be read via lance directly
df = (
    lance.
    dataset("./alice_bob_and_charlie.lance")
    .to_table()
    .to_pandas()
    .sort_values(by=["id"])
    .reset_index(drop=True)
)
assert df.equals(pd.DataFrame(data)), "{} != {}".format(
    df, pd.DataFrame(data)
)

# Or via Ray.data.read_lance
ray_df = (
    ray.data.read_lance("./alice_bob_and_charlie.lance")
    .to_pandas()
    .sort_values(by=["id"])
    .reset_index(drop=True)
)
assert df.equals(ray_df)

TITLE: Load PyTorch Model State Dictionary from Lance Dataset (Python) DESCRIPTION: This function reads all model weights from a specified Lance dataset file and constructs an OrderedDict suitable as a PyTorch model state dictionary. It iterates through each weight, converting it using _load_weight, and places it on the specified device. This function assumes all weights can fit into memory; large models may cause memory errors.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_5

LANGUAGE: python CODE:

def _load_state_dict(file_name: str, version: int = 1, map_location=None) -> OrderedDict:
    """Reads the model weights from lance file and returns a model state dict
    If the model weights are too large, this function will fail with a memory error.

    Args:
        file_name (str): Lance model name
        version (int): Version of the model to load
        map_location (str): Device to load the model on

    Returns:
        OrderedDict: Model state dict
    """
    ds = lance.dataset(file_name, version=version)
    weights = ds.take([x for x in range(ds.count_rows())]).to_pylist()
    state_dict = OrderedDict()

    for weight in weights:
        state_dict[weight["name"]] = _load_weight(weight).to(map_location)

    return state_dict

TITLE: Load Data Chunk from Lance Dataset by Indices DESCRIPTION: This utility function, from_indices, efficiently loads specific elements from a Lance dataset based on a list of provided indices. It takes a Lance dataset object and a list of integer indices, then retrieves the corresponding rows. The function processes these rows to extract only the 'input_ids' from each, returning them as a list of token IDs, which is crucial for preparing data chunks.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_1

LANGUAGE: python CODE:

def from_indices(dataset, indices):
    """Load the elements on given indices from the dataset"""
    chunk = dataset.take(indices).to_pylist()
    chunk = list(map(lambda x: x['input_ids'], chunk))
    return chunk

TITLE: Run LanceDB Vector Search Recall Test DESCRIPTION: Defines run_test, a comprehensive function for evaluating LanceDB's vector search recall. It generates ground truth, writes data to a temporary LanceDB dataset, creates an IVF_PQ index, and performs nearest neighbor queries with varying nprobes and refine_factor to calculate recall for both in-sample and out-of-sample queries.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_2

LANGUAGE: python CODE:

def run_test(
    data,
    query,
    metric,
    num_partitions=256,
    num_sub_vectors=8,
    nprobes_list=[1, 2, 5, 10, 16],
    refine_factor_list=[1, 2, 5, 10, 20],
):
    in_sample = data[random.sample(range(data.shape[0]), 1000), :]
    # ground truth
    print("generating gt")

    gt = knn(query, data, metric, 10)
    gt_in_sample = knn(in_sample, data, metric, 10)

    print("generated gt")
    
    with tempfile.TemporaryDirectory() as d:
        write_lance(d, data)
        ds = lance.dataset(d)

        for q, target in zip(tqdm(in_sample, desc="checking brute force"), gt_in_sample):
            res = ds.to_table(nearest={
                "column": "vec",
                "q": q,
                "k": 10,
                "metric": metric,
            }, columns=["id"])
            assert len(np.intersect1d(res["id"].to_numpy(), target)) == 10
    
        ds = ds.create_index("vec", "IVF_PQ", metric=metric, num_partitions=num_partitions, num_sub_vectors=num_sub_vectors)
    
        recall_data = []
        for nprobes in nprobes_list:
            for refine_factor in refine_factor_list:
                hits = 0
                # check that brute force impl is correct
                for q, target in zip(tqdm(query, desc=f"out of sample, nprobes={nprobes}, refine={refine_factor}"), gt):
                    res = ds.to_table(nearest={
                        "column": "vec",
                        "q": q,
                        "k": 10,
                        "nprobes": nprobes,
                        "refine_factor": refine_factor,
                    }, columns=["id"])["id"].to_numpy()
                    hits += len(np.intersect1d(res, target))
                recall_data.append([
                    "out_of_sample",
                    nprobes,
                    refine_factor,
                    hits / 10 / len(gt),
                ])
                # check that brute force impl is correct
                for q, target in zip(tqdm(in_sample, desc=f"in sample nprobes={nprobes}, refine={refine_factor}"), gt_in_sample):
                    res = ds.to_table(nearest={
                        "column": "vec",
                        "q": q,
                        "k": 10,
                        "nprobes": nprobes,
                        "refine_factor": refine_factor,
                    }, columns=["id"])["id"].to_numpy()
                    hits += len(np.intersect1d(res, target))
                recall_data.append([
                    "in_sample",
                    nprobes,
                    refine_factor,
                    hits / 10 / len(gt_in_sample),
                ])
    return recall_data

TITLE: Stream PyArrow RecordBatches to Lance Dataset DESCRIPTION: Shows how to write a Lance dataset from an iterator of pyarrow.RecordBatch objects. This method is ideal for large datasets that cannot be fully loaded into memory, requiring a pyarrow.Schema to be provided.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_1

LANGUAGE: python CODE:

def producer() -> Iterator[pa.RecordBatch]:
    """An iterator of RecordBatches."""
    yield pa.RecordBatch.from_pylist([{"name": "Alice", "age": 20}])
    yield pa.RecordBatch.from_pylist([{"name": "Bob", "age": 30}])

schema = pa.schema([
    ("name", pa.string()),
    ("age", pa.int32()),
])

ds = lance.write_dataset(producer(),
                         "./alice_and_bob.lance",
                         schema=schema, mode="overwrite")
print(ds.count_rows())  # Output: 2

TITLE: LanceDB External Manifest Store Commit Operations DESCRIPTION: Details the four-step commit process when using an external manifest store for concurrent writes in LanceDB, involving staging manifests, committing paths to the external store, and finalizing the manifest in object storage.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_12

LANGUAGE: APIDOC CODE:

External Store Commit Process:
  1. PUT_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid}
     - Action: Stage new manifest in object store under unique path.
  2. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest-{uuid}
     - Action: Commit staged manifest path to external KV store.
     - Note: Commit is effectively complete after this step.
  3. COPY_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid} mydataset.lance/_versions/{version}.manifest
     - Action: Copy staged manifest to final path.
  4. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest
     - Action: Update external store to point to final manifest.

TITLE: Write PyArrow Table to Lance Dataset DESCRIPTION: Demonstrates how to write a pyarrow.Table to a Lance dataset using the lance.write_dataset function. This is suitable for datasets that can be fully loaded into memory.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_0

LANGUAGE: python CODE:

import lance
import pyarrow as pa

table = pa.Table.from_pylist([{"name": "Alice", "age": 20},
                              {"name": "Bob", "age": 30}])
ds = lance.write_dataset(table, "./alice_and_bob.lance")

TITLE: Lance DataFragment Protobuf Message Reference DESCRIPTION: References the Protobuf message definition for DataFragment, which represents a logical chunk of data within a Lance dataset. It can include one or more DataFiles and an optional DeletionFile.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_2

LANGUAGE: APIDOC CODE:

proto.message.DataFragment

TITLE: Import Libraries for LanceDB Vector Search Testing DESCRIPTION: Imports necessary Python libraries for numerical operations (numpy), temporary file handling (tempfile), data manipulation (pandas), plotting (seaborn, matplotlib), and LanceDB specific functionalities (lance, _lib). These imports provide the foundational tools for the vector search tests.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_1

LANGUAGE: python CODE:

from _lib import knn, write_lance, _get_nyt_vectors

import numpy as np
import tempfile
import random
import lance
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from tqdm.auto import tqdm

TITLE: Generate SIFT-1M Database Vectors Lance Dataset DESCRIPTION: Uses the datagen.py script to convert SIFT-1M base vectors into a Lance dataset. This dataset will serve as the primary data source for indexing and querying in the SIFT-1M benchmark.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_3

LANGUAGE: sh CODE:

./datagen.py ./sift/sift_base.fvecs ./.lancedb/sift1m.lance -d 128

TITLE: Compact LanceDB Dataset Files with Python DESCRIPTION: This Python code demonstrates how to compact data files within a LanceDB dataset using the compact_files method. It specifies a target_rows_per_fragment to optimize file count and can remove soft-deleted rows, improving query performance. Note that compaction creates a new table version and invalidates old row addresses for indexing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_21

LANGUAGE: python CODE:

import lance

dataset = lance.dataset("./alice_and_bob.lance")
dataset.optimize.compact_files(target_rows_per_fragment=1024 * 1024)

TITLE: Prepare PyTorch Model State Dict for LanceDB Saving DESCRIPTION: This utility function processes a PyTorch model's state_dict, iterating through each parameter. It flattens the parameter's tensor, extracts its name and original shape, and then packages this information into a PyArrow RecordBatch. This prepares the model weights for efficient storage in a LanceDB dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_2

LANGUAGE: python CODE:

def _save_model_writer(state_dict):
    """Yields a RecordBatch for each parameter in the model state dict"""
    for param_name, param in state_dict.items():
        param_shape = list(param.size())
        param_value = param.flatten().tolist()
        yield pa.RecordBatch.from_arrays(
            [
                pa.array(
                    [param_name],
                    pa.string(),
                ),
                pa.array(
                    [param_value],
                    pa.list_(pa.float64(), -1),
                ),
                pa.array(
                    [param_shape],
                    pa.list_(pa.int64(), -1),
                ),
            ],
            ["name", "value", "shape"],
        )

TITLE: Create PyTorch DataLoader from LanceDataset (Safe) DESCRIPTION: This snippet demonstrates how to create a multiprocessing-safe PyTorch DataLoader using SafeLanceDataset and get_safe_loader. It explicitly uses the 'spawn' method to avoid fork-safety issues that can arise when LanceDB's internal multithreading interacts with Python's multiprocessing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_2

LANGUAGE: python CODE:

from lance.torch.data import SafeLanceDataset, get_safe_loader

dataset = SafeLanceDataset(temp_lance_dataset)
# use spawn method to avoid fork-safe issue
loader = get_safe_loader(
    dataset,
    num_workers=2,
    batch_size=16,
    drop_last=False,
)

total_samples = 0
for batch in loader:
    total_samples += batch["id"].shape[0]

TITLE: Generate SIFT-1M Ground Truth Lance Dataset DESCRIPTION: Generates a ground truth Lance dataset for SIFT-1M using the gt.py script. This dataset is essential for evaluating the recall of the benchmark queries against known correct results.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_4

LANGUAGE: sh CODE:

./gt.py ./.lancedb/sift1m.lance -o ./.lancedb/ground_truth.lance

TITLE: Lindera User Dictionary Configuration File (config.yml) DESCRIPTION: YAML configuration for Lindera, defining the segmenter mode and the path to the dictionary. This file, typically named config.yml, can be placed in the model's root directory or specified via the LINDERA_CONFIG_PATH environment variable. The kind field is not supported in LanceDB's context.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_6

LANGUAGE: yaml CODE:

segmenter:
    mode: "normal"
    dictionary:
        # Note: in lance, the `kind` field is not supported. You need to specify the model path using the `path` field instead.
        path: /path/to/lindera/ipadic/main

TITLE: Test LanceDB Vector Search with Random Data (L2 Metric) DESCRIPTION: Demonstrates running the run_test function with randomly generated data (100,000 vectors, 64 dimensions) and queries, using the L2 (Euclidean) distance metric. It then visualizes the recall results using make_plot.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_4

LANGUAGE: python CODE:

# test randomly generated data
data = np.random.standard_normal((100000, 64))
query = np.random.standard_normal((1000, 64))

recall_data = run_test(
    data,
    query,
    "L2",
)

make_plot(recall_data)

TITLE: Lance ColumnMetadata Protobuf Message Reference DESCRIPTION: References the Protobuf message definition for ColumnMetadata, which is used to describe the encoding and properties of individual columns within a .lance file.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_3

LANGUAGE: APIDOC CODE:

proto.message.ColumnMetadata

TITLE: Generate GIST-1M Query Vectors Lance Dataset DESCRIPTION: Converts GIST-1M query vectors into a Lance dataset using datagen.py. These vectors will be used to perform similarity searches against the indexed database during the benchmark.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_10

LANGUAGE: sh CODE:

./datagen.py ./gist/gist_query.fvecs ./.lancedb/gist_query.lance -g 1024 -m 50000 -d 960 -n 1000

TITLE: Test LanceDB Vector Search with Random Data (Cosine Metric) DESCRIPTION: Shows how to execute the run_test function using randomly generated data and queries, but this time employing the cosine similarity metric. The recall performance is subsequently plotted using make_plot.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_5

LANGUAGE: python CODE:

# test randomly generated data -- cosine
data = np.random.standard_normal((100000, 64))
query = np.random.standard_normal((1000, 64))

recall_data = run_test(
    data,
    query,
    "cosine",
)

make_plot(recall_data)

TITLE: Load PyTorch Model with Weights from Lance Dataset (Python) DESCRIPTION: This high-level function facilitates loading weights directly into a given PyTorch model from a Lance dataset. It internally calls _load_state_dict to retrieve the complete state dictionary and then applies it to the provided model instance. This simplifies the process of restoring a model's state from a Lance-backed storage.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_6

LANGUAGE: python CODE:

def load_model(
    model: torch.nn.Module, file_name: str, version: int = 1, map_location=None
):
    """Loads the model weights from lance file and sets them to the model

    Args:
        model (torch.nn.Module): PyTorch model
        file_name (str): Lance model name
        version (int): Version of the model to load
        map_location (str): Device to load the model on
    """
    state_dict = _load_state_dict(file_name, version=version, map_location=map_location)
    model.load_state_dict(state_dict)

TITLE: Connect LanceDB to Azure Blob Storage DESCRIPTION: This Python snippet illustrates how to connect a LanceDB dataset to Azure Blob Storage. It shows how to pass account_name and account_key directly via storage_options, offering an alternative to environment variables.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_9

LANGUAGE: python CODE:

import lance
ds = lance.dataset(
    "az://my-container/my-dataset",
    storage_options={
        "account_name": "some-account",
        "account_key": "some-key",
    }
)

TITLE: Default Lance Language Model Home Directory DESCRIPTION: This snippet illustrates the default directory path where LanceDB stores language models if the LANCE_LANGUAGE_MODEL_HOME environment variable is not explicitly set by the user.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_0

LANGUAGE: bash CODE:

${system data directory}/lance/language_models

TITLE: Perform Random Row Access in Lance Dataset DESCRIPTION: This Python snippet demonstrates Lance's capability for fast random access to individual rows using the take() method. This feature is crucial for workflows like random sampling, shuffling in ML training, and building secondary indices for enhanced query performance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_20

LANGUAGE: python CODE:

data = ds.take([1, 100, 500], columns=["image", "label"])

TITLE: Configure AWS Credentials for LanceDB S3 Dataset DESCRIPTION: Demonstrates how to pass AWS access key ID, secret access key, and session token directly to the storage_options parameter when initializing a LanceDB dataset from an S3 path. This method provides explicit credential management for S3 access, overriding environment variables if set.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_3

LANGUAGE: python CODE:

import lance
ds = lance.dataset(
    "s3://bucket/path",
    storage_options={
        "access_key_id": "my-access-key",
        "secret_access_key": "my-secret-key",
        "session_token": "my-session-token",
    }
)

TITLE: Create Scalar Index with Jieba Tokenizer in Python DESCRIPTION: Python code demonstrating how to create a scalar index on a 'text' field using the 'INVERTED' index type, specifying 'jieba/default' as the base tokenizer for text processing within LanceDB.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_2

LANGUAGE: python CODE:

ds.create_scalar_index("text", "INVERTED", base_tokenizer="jieba/default")

TITLE: Add and Populate Columns with SQL Expressions in Lance DESCRIPTION: Illustrates adding and populating new columns in a Lance dataset using SQL expressions. This method allows defining column values based on existing columns or literal values, enabling data backfill within a single operation.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_1

LANGUAGE: python CODE:

table = pa.table({"name": pa.array(["Alice", "Bob", "Carla"])})
dataset = lance.write_dataset(table, "names")
dataset.add_columns({
    "hash": "sha256(name)",
    "status": "'active'",
})
print(dataset.to_table().to_pandas())

TITLE: Perform Nearest Neighbor Vector Search on Lance Dataset DESCRIPTION: Demonstrates how to perform nearest neighbor searches on a Lance dataset with a vector index. It samples query vectors using DuckDB and then retrieves the top 10 similar vectors for each query using Lance's nearest functionality, showcasing its vector search capabilities.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_9

LANGUAGE: python CODE:

# Get top 10 similar vectors
import duckdb

dataset = lance.dataset(uri)

# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
      for q in query_vectors]

TITLE: Convert Parquet to Lance Dataset DESCRIPTION: Demonstrates how to convert a Pandas DataFrame to a PyArrow Table, save it as a Parquet file, and then convert the Parquet dataset into a Lance dataset. This showcases Lance's compatibility with existing data formats and its ease of use for data migration.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_2

LANGUAGE: python CODE:

import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")

TITLE: Define PyArrow Schema with Lance Encoding Metadata DESCRIPTION: This Python snippet demonstrates how to define a PyArrow schema for a LanceDB table, applying column-level encoding configurations. It shows how to use PyArrow field metadata to specify compression algorithms, compression levels, structural encoding strategies, and packed memory layout for string columns.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_7

LANGUAGE: python CODE:

import pyarrow as pa

schema = pa.schema([
    pa.field(
        "compressible_strings",
        pa.string(),
        metadata={
            "lance-encoding:compression": "zstd",
            "lance-encoding:compression-level": "3",
            "lance-encoding:structural-encoding": "miniblock",
            "lance-encoding:packed": "true"
        }
    )
])

TITLE: Configure Seaborn Plot Style DESCRIPTION: This snippet imports the seaborn library and sets the default plot style to 'darkgrid'. This improves the visual aesthetics of subsequent plots generated using seaborn.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_1

LANGUAGE: python CODE:

import seaborn as sns
sns.set_style("darkgrid")

TITLE: Generate SIFT-1M Query Vectors Lance Dataset DESCRIPTION: Converts SIFT-1M query vectors into a Lance dataset using datagen.py. These vectors will be used to perform similarity searches against the indexed database during the benchmark.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_5

LANGUAGE: sh CODE:

./datagen.py ./sift/sift_query.fvecs ./.lancedb/sift_query.lance -d 128 -n 1000

TITLE: Convert SIFT1M Dataset to Lance for Vector Search DESCRIPTION: Loads the SIFT1M dataset from a binary file, converts its raw vector data into a NumPy array, and then transforms it into a Lance table using vec_to_table. The dataset is then written to a Lance file, optimized for vector search with specific row group and file size settings.

SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_7

LANGUAGE: python CODE:

import lance
from lance.vector import vec_to_table
import numpy as np
import struct

nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
    dd = dict(zip(range(nvecs), data))

table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)

TITLE: Load Entire Lance Dataset into Memory DESCRIPTION: This Python snippet shows how to load an entire Lance dataset into an in-memory table using the to_table() method. This approach is straightforward and suitable for datasets that can comfortably fit within available memory.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_13

LANGUAGE: python CODE:

table = ds.to_table()

TITLE: Lance SQL Type to Apache Arrow Type Mapping DESCRIPTION: This table provides a comprehensive mapping between SQL data types supported by Lance and their corresponding Apache Arrow data types. It details the internal storage format for various data representations, crucial for understanding data compatibility and performance.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_19

LANGUAGE: APIDOC CODE:

| SQL type | Arrow type |
|----------|------------|
| `boolean` | `Boolean` |
| `tinyint` / `tinyint unsigned` | `Int8` / `UInt8` |
| `smallint` / `smallint unsigned` | `Int16` / `UInt16` |
| `int` or `integer` / `int unsigned` or `integer unsigned` | `Int32` / `UInt32` |
| `bigint` / `bigint unsigned` | `Int64` / `UInt64` |
| `float` | `Float32` |
| `double` | `Float64` |
| `decimal(precision, scale)` | `Decimal128` |
| `date` | `Date32` |
| `timestamp` | `Timestamp` (1) |
| `string` | `Utf8` |
| `binary` | `Binary` |

TITLE: Visualize LanceDB Vector Search Recall Heatmap DESCRIPTION: Defines make_plot, a utility function to visualize the recall data generated by run_test. It takes the recall data (a list of lists) and converts it into a pandas DataFrame, then uses seaborn to generate heatmaps showing recall across different nprobes and refine_factor values for various test cases.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_3

LANGUAGE: python CODE:

def make_plot(recall_data):
    df = pd.DataFrame(recall_data, columns=["case", "nprobes", "refine_factor", "recall"])
    
    num_cases = len(df["case"].unique())
    (fig, axs) = plt.subplots(1, 2, figsize=(16, 8))
    
    for case, ax in zip(df["case"].unique(), axs):
        current_case = df[df["case"] == case]
        sns.heatmap(
            current_case.drop(columns=["case"]).set_index(["nprobes", "refine_factor"])["recall"].unstack(),
            annot=True,
            ax=ax,
        ).set(title=f"Recall -- {case}")

TITLE: Count unique video titles in dataset DESCRIPTION: Converts the loaded dataset to a Pandas DataFrame and counts the number of unique video titles. This provides an overview of the diversity and scope of the video content within the dataset.

SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_2

LANGUAGE: python CODE:

data.to_pandas().title.nunique()

TITLE: Describe Median Latency by Refine Factor DESCRIPTION: This snippet groups the DataFrame by the 'refine_factor' column and calculates descriptive statistics for the '50%' (median response time) column. This provides an understanding of latency variations across different refinement factors.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_6

LANGUAGE: python CODE:

df.groupby("refine_factor")["50%"].describe()

TITLE: Utility functions to load images and captions from Lance dataset DESCRIPTION: These two Python functions, load_image and load_caption, facilitate loading data from a Lance dataset. load_image converts byte-formatted images to a usable image format using numpy and OpenCV, while load_caption extracts the longest caption associated with an image, assuming it contains the most descriptive information.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_2

LANGUAGE: python CODE:

def load_image(ds, idx):
    # Utility function to load an image at an index and convert it from bytes format to img format
    raw_img = ds.take([idx], columns=['image']).to_pydict()
    raw_img = np.frombuffer(b''.join(raw_img['image']), dtype=np.uint8)
    img = cv2.imdecode(raw_img, cv2.IMREAD_COLOR)
    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
    return img

def load_caption(ds, idx):
    # Utility function to load an image's caption. Currently we return the longest caption of all
    captions = ds.take([idx], columns=['captions']).to_pydict()['captions'][0]
    return max(captions, key=len)

TITLE: Save PyTorch Model Weights to LanceDB with Versioning DESCRIPTION: This function saves a PyTorch model's state_dict to a LanceDB file. It utilizes the _save_model_writer utility to format the data. The function supports both overwriting existing model weights or saving them as a new version within the Lance dataset, providing flexibility for model checkpoint management.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_3

LANGUAGE: python CODE:

def save_model(state_dict: OrderedDict, file_name: str, version=False):
    """Saves a PyTorch model in lance file format

    Args:
        state_dict (OrderedDict): Model state dict
        file_name (str): Lance model name
        version (bool): Whether to save as a new version or overwrite the existing versions,
            if the lance file already exists
    """
    # Create a reader
    reader = pa.RecordBatchReader.from_batches(
        GLOBAL_SCHEMA, _save_model_writer(state_dict)
    )

    if os.path.exists(file_name):
        if version:
            # If we want versioning, we use the overwrite mode to create a new version
            lance.write_dataset(
                reader, file_name, schema=GLOBAL_SCHEMA, mode="overwrite"
            )
        else:
            # If we don't want versioning, we delete the existing file and write a new one
            shutil.rmtree(file_name)
            lance.write_dataset(reader, file_name, schema=GLOBAL_SCHEMA)
    else:
        # If the file doesn't exist, we write a new one
        lance.write_dataset(reader, file_name, schema=GLOBAL_SCHEMA)

TITLE: Protobuf Definition for Row ID Sequence Storage DESCRIPTION: This protobuf oneof field defines how row ID sequences are stored. Small sequences are stored directly as inline_sequence bytes to avoid I/O overhead, while large sequences are referenced via an external_file path to optimize storage and retrieval.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_16

LANGUAGE: Protobuf CODE:

oneof row_id_sequence {
    // Inline sequence
    bytes inline_sequence = 1;
    // External file reference
    string external_file = 2;
} // row_id_sequence

TITLE: Drop Columns in LanceDB Dataset DESCRIPTION: This snippet demonstrates how to drop columns from a LanceDB dataset using the lance.LanceDataset.drop_columns method. This is a metadata-only operation, making it very fast. It also explains that physical data removal requires lance.dataset.DatasetOptimizer.compact_files() followed by lance.LanceDataset.cleanup_old_versions().

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_4

LANGUAGE: python CODE:

table = pa.table({"id": pa.array([1, 2, 3]),
                 "name": pa.array(["Alice", "Bob", "Carla"])})
dataset = lance.write_dataset(table, "names", mode="overwrite")
dataset.drop_columns(["name"])
print(dataset.schema)
# id: int64

TITLE: Define CLIP Model Components (ImageEncoder, TextEncoder, Head) in PyTorch DESCRIPTION: This snippet defines the core neural network modules for a CLIP-like model. ImageEncoder uses a pre-trained vision model (e.g., ResNet) to convert images to feature vectors. TextEncoder uses a pre-trained language model (e.g., BERT) for text embeddings. The Head module projects these features into a common embedding space using linear layers, GELU activation, dropout, and layer normalization.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_5

LANGUAGE: python CODE:

class ImageEncoder(nn.Module):
    """Encodes the Image"""
    def __init__(self, model_name, pretrained = True):
        super().__init__()
        self.backbone = timm.create_model(
            model_name,
            pretrained=pretrained,
            num_classes=0,
            global_pool="avg"
        )

        for param in self.backbone.parameters():
            param.requires_grad = True

    def forward(self, img):
        return self.backbone(img)

class TextEncoder(nn.Module):
    """Encodes the Caption"""
    def __init__(self, model_name):
        super().__init__()

        self.backbone = AutoModel.from_pretrained(model_name)

        for param in self.backbone.parameters():
            param.requires_grad = True

    def forward(self, captions):
        output = self.backbone(**captions)
        return output.last_hidden_state[:, 0, :]

class Head(nn.Module):
    """Projects both into Embedding space"""
    def __init__(self, embedding_dim, projection_dim):
        super().__init__()
        self.projection = nn.Linear(embedding_dim, projection_dim)
        self.gelu = nn.GELU()
        self.fc = nn.Linear(projection_dim, projection_dim)

        self.dropout = nn.Dropout(0.3)
        self.layer_norm = nn.LayerNorm(projection_dim)

    def forward(self, x):
        projected = self.projection(x)
        x = self.gelu(projected)
        x = self.fc(x)
        x = self.dropout(x)
        x += projected

        return self.layer_norm(x)

TITLE: Retrieve Specific Records from a Lance Dataset in Rust DESCRIPTION: Retrieves specific records from a Lance Dataset based on their indices and a projection. The result is a RecordBatch containing the requested data.

SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_3

LANGUAGE: rust CODE:

let values: Result<RecordBatch> = dataset.take(&[200, 199, 39, 40, 100], &projection).await;

TITLE: Define PyArrow schema for storing PyTorch model weights in Lance DESCRIPTION: This snippet defines a pyarrow.Schema named GLOBAL_SCHEMA specifically designed for storing PyTorch model weights within the Lance file format. The schema includes three fields: 'name' (string) for the weight's identifier, 'value' (list of float64) for the flattened weight tensor, and 'shape' (list of int64) to preserve the original dimensions for reconstruction.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_1

LANGUAGE: python CODE:

GLOBAL_SCHEMA = pa.schema(
    [
        pa.field("name", pa.string()),
        pa.field("value", pa.list_(pa.float64(), -1)),
        pa.field("shape", pa.list_(pa.int64(), -1)) # Is a list with variable shape because weights can have any number of dims
    ]
)

TITLE: Create Lance ImageURIArray from URI List DESCRIPTION: This snippet demonstrates how to initialize a lance.arrow.ImageURIArray from a list of image URIs. This array type is designed to store references to images in various storage systems (local, file, S3) for lazy loading, without validating or loading the images into memory immediately.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_3

LANGUAGE: python CODE:

from lance.arrow import ImageURIArray

ImageURIArray.from_uris([
   "/tmp/image1.jpg",
   "file:///tmp/image2.jpg",
   "s3://example/image3.jpg"
])
# <lance.arrow.ImageURIArray object at 0x...>
# ['/tmp/image1.jpg', 'file:///tmp/image2.jpg', 's3://example/image3.jpg']

TITLE: Lance Execution Node Contract Definition DESCRIPTION: Defines the contract for various execution nodes within Lance's I/O execution plan, detailing their parameters, input schemas, and output schemas.

SOURCE: https://github.com/lancedb/lance/blob/main/wiki/I-O-Execution.md#_snippet_0

LANGUAGE: APIDOC CODE:

Execution Nodes:
  Scan:
    Parameters: dataset, projected columns
    Input Schema: N/A
    Output Schema: projected columns
  Filter:
    Parameters: input node, filter
    Input Schema: any
    Output Schema: input schema + columns in filters
  Take:
    Parameters: input node
    Input Schema: any, must have a "_rowid" column
    Output Schema: input schema minus _rowid
  KNNFlatExec:
    Parameters: input node, query
    Input Schema: any
    Output Schema: input schema + {"scores"}
  KNNIndexExec:
    Parameters: dataset
    Input Schema: N/A
    Output Schema: {"score", "_rowid"}

TITLE: Drop Columns from LanceDB Dataset in Java DESCRIPTION: Shows how to remove specified columns from a LanceDB dataset. This operation simplifies the dataset's schema by eliminating unnecessary fields.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_8

LANGUAGE: java CODE:

void dropColumns() {
    String datasetPath = ""; // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator()) {
        try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
            dataset.dropColumns(Collections.singletonList("name"));
        }
    }
}

TITLE: Describe Median Latency by IVF Index DESCRIPTION: This snippet groups the DataFrame by the 'ivf' column and calculates descriptive statistics (count, mean, std, min, max, quartiles) for the '50%' (median response time) column. This helps understand latency distribution across different IVF index configurations.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_3

LANGUAGE: python CODE:

df.groupby("ivf")["50%"].describe()

TITLE: Update Rows with Complex SQL Expressions DESCRIPTION: Shows how to update column values using complex SQL expressions that can reference existing columns, such as incrementing an age column by a fixed value.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_5

LANGUAGE: python CODE:

import lance

dataset = lance.dataset("./alice_and_bob.lance")
dataset.update({"age": "age + 2"})

TITLE: Add Rows to Lance Dataset DESCRIPTION: Illustrates two methods for adding new rows to an existing Lance dataset: using the LanceDataset.insert method for direct insertion and using lance.write_dataset with mode="append" to append new data.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_2

LANGUAGE: python CODE:

import lance
import pyarrow as pa

table = pa.Table.from_pylist([{"name": "Alice", "age": 20},
                              {"name": "Bob", "age": 30}])
ds = lance.write_dataset(table, "./insert_example.lance")

new_table = pa.Table.from_pylist([{"name": "Carla", "age": 37}])
ds.insert(new_table)
print(ds.to_table().to_pandas())
#     name  age
# 0  Alice   20
# 1    Bob   30
# 2  Carla   37

new_table2 = pa.Table.from_pylist([{"name": "David", "age": 42}])
ds = lance.write_dataset(new_table2, ds, mode="append")
print(ds.to_table().to_pandas())
#     name  age
# 0  Alice   20
# 1    Bob   30
# 2  Carla   37
# 3  David   42

TITLE: Bulk Update Rows in LanceDB Dataset using Merge Insert DESCRIPTION: Demonstrates how to efficiently replace existing rows in a LanceDB dataset with new data using merge_insert and when_matched_update_all(). This operation uses a key for matching rows, typically a unique identifier. Note that modified rows are re-inserted, changing their position to the end of the table.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_7

LANGUAGE: python CODE:

import lance

dataset = lance.dataset("./alice_and_bob.lance")
print(dataset.to_table().to_pandas())
#     name  age
# 0  Alice   20
# 1    Bob   30

# Change the ages of both Alice and Bob
new_table = pa.Table.from_pylist([{"name": "Alice", "age": 2},
                                  {"name": "Bob", "age": 3}])
# This will use `name` as the key for matching rows.  Merge insert
# uses a JOIN internally and so you typically want this column to
# be a unique key or id of some kind.
rst = dataset.merge_insert("name") \
       .when_matched_update_all() \
       .execute(new_table)
print(dataset.to_table().to_pandas())
#     name  age
# 0  Alice    2
# 1    Bob    3

TITLE: Load Single Weight Tensor from Lance Dataset (Python) DESCRIPTION: This function converts a single weight entry, retrieved as a dictionary from a Lance dataset, into a PyTorch tensor. It reshapes the flattened 'value' array using the 'shape' information stored within the weight dictionary. The output is a torch.Tensor ready for further processing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/artifact_management.md#_snippet_4

LANGUAGE: python CODE:

def _load_weight(weight: dict) -> torch.Tensor:
    """Converts a weight dict to a torch tensor"""
    return torch.tensor(weight["value"], dtype=torch.float64).reshape(weight["shape"])

TITLE: Perform Parallel Writes with lance.fragment.write_fragments DESCRIPTION: This code demonstrates how to write new data fragments in parallel across multiple workers using lance.fragment.write_fragments. Each worker generates its own set of fragments, which are then printed for verification. This is the first phase of a distributed write operation, preparing data for a later commit.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_0

LANGUAGE: python CODE:

import json
from lance.fragment import write_fragments

# Run on each worker
data_uri = "./dist_write"
schema = pa.schema([
    ("a", pa.int32()),
    ("b", pa.string()),
])

# Run on worker 1
data1 = {
    "a": [1, 2, 3],
    "b": ["x", "y", "z"],
}
fragments_1 = write_fragments(data1, data_uri, schema=schema)
print("Worker 1: ", fragments_1)

# Run on worker 2
data2 = {
    "a": [4, 5, 6],
    "b": ["u", "v", "w"],
}
fragments_2 = write_fragments(data2, data_uri, schema=schema)
print("Worker 2: ", fragments_2)

TITLE: Drop Lance Dataset in Java DESCRIPTION: This Java code illustrates how to permanently delete a Lance dataset from the file system. It takes the dataset's path and uses the static Dataset.drop method to remove all associated files and metadata. This operation is irreversible and should be used with caution.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_4

LANGUAGE: Java CODE:

void dropDataset() {
    String datasetPath = tempDir.resolve("drop_stream").toString();
    Dataset.drop(datasetPath, new HashMap<>());
}

TITLE: LanceDB Statistics Storage DESCRIPTION: Describes how statistics (null count, min, max) are stored within Lance files in a columnar format, enabling selective reading for query optimization.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_14

LANGUAGE: APIDOC CODE:

Statistics Storage:
  - Location: Stored within Lance files.
  - Purpose: Determine which pages to skip during queries.
  - Data Points: null count, lower bound (min), upper bound (max).
  - Format: Lance's columnar format.
  - Benefit: Allows selective reading of relevant stats columns.

TITLE: Alter Columns in LanceDB Dataset in Java DESCRIPTION: Illustrates how to modify existing columns within a LanceDB dataset. This includes renaming a column, changing its nullability, or casting its data type to a new ArrowType, facilitating schema adjustments.

SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_7

LANGUAGE: java CODE:

void alterColumns() {
    String datasetPath = ""; // specify a path point to a dataset
    try (BufferAllocator allocator = new RootAllocator()) {
        try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
            ColumnAlteration nameColumnAlteration =
                    new ColumnAlteration.Builder("name")
                            .rename("new_name")
                            .nullable(true)
                            .castTo(new ArrowType.Utf8())
                            .build();

            dataset.alterColumns(Collections.singletonList(nameColumnAlteration));
        }
    }
}

TITLE: Group and sort captions by image ID DESCRIPTION: This section iterates through all unique image IDs found in the annotations. For each image, it collects all associated captions and sorts them based on their original annotation number, ensuring the correct order of captions for each image. The result is a list of tuples, each containing an image ID and a tuple of its ordered captions.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/flickr8k_dataset_creation.md#_snippet_2

LANGUAGE: python CODE:

captions = []
image_ids = set(ann[0] for ann in annotations)
for img_id in tqdm(image_ids):
    current_img_captions = []
    for ann_img_id, num, caption in annotations:
        if img_id == ann_img_id:
            current_img_captions.append((num, caption))
            
    # Sort by the annotation number
    current_img_captions.sort(key=lambda x: x[0])
    captions.append((img_id, tuple([x[1] for x in current_img_captions])))

TITLE: Create Scalar Index with Lindera Tokenizer in Python DESCRIPTION: Python code demonstrating how to create a scalar index on a 'text' field using the 'INVERTED' index type, specifying 'lindera/ipadic' as the base tokenizer for text processing within LanceDB.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_5

LANGUAGE: python CODE:

ds.create_scalar_index("text", "INVERTED", base_tokenizer="lindera/ipadic")

TITLE: Create Pandas Series with Lance BFloat16 Dtype DESCRIPTION: This snippet demonstrates how to create a Pandas Series using the lance.bfloat16 custom dtype. It shows the initialization of a Series with floating-point numbers, which are then converted to the BFloat16 format, suitable for machine learning applications.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_0

LANGUAGE: python CODE:

import lance.arrow

pd.Series([1.1, 2.1, 3.4], dtype="lance.bfloat16")
# 0    1.1015625
# 1      2.09375
# 2      3.40625
# dtype: lance.bfloat16

TITLE: Define Lance Schema with Blob Column in Python DESCRIPTION: This Python code demonstrates how to define a PyArrow schema for a Lance dataset, marking a large_binary column as a blob column by setting the lance-encoding:blob metadata to true. This configuration enables Lance to efficiently store and retrieve large binary objects.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/blob.md#_snippet_0

LANGUAGE: python CODE:

import pyarrow as pa

schema = pa.schema(
    [
        pa.field("id", pa.int64()),
        pa.field("video",
            pa.large_binary(),
            metadata={"lance-encoding:blob": "true"}
        ),
    ]
)

TITLE: Describe Median Latency by NProbes DESCRIPTION: This snippet groups the DataFrame by the 'nprobes' column and calculates descriptive statistics for the '50%' (median response time) column. This helps analyze how the number of probes affects median query latency.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_5

LANGUAGE: python CODE:

df.groupby("nprobes")["50%"].describe()

TITLE: Test LanceDB Vector Search with NYT TF-IDF Vectors (Cosine Metric) DESCRIPTION: Illustrates testing LanceDB's vector search with real-world data: sparse TF-IDF vectors from the New York Times dataset, projected to 256 dimensions. It uses the cosine similarity metric and custom index parameters (num_partitions=256, num_sub_vectors=32).

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_6

LANGUAGE: python CODE:

# test NYT -- TF-IDF sparse vectors projected on to 256D dense -- cosine
data = _get_nyt_vectors()
data = data[np.linalg.norm(data, axis=1) != 0]
data = np.unique(data, axis=0)
query = np.random.standard_normal((100, 256))

recall_data = run_test(
    data,
    query,
    "cosine",
    num_partitions=256,
    num_sub_vectors=32,
)

make_plot(recall_data)

TITLE: Test LanceDB Vector Search with NYT TF-IDF Vectors (Normalized L2 Metric) DESCRIPTION: Presents a test case using the same NYT TF-IDF vectors, but normalized for L2 distance, effectively making it equivalent to cosine similarity on normalized vectors. It uses the L2 metric with specific index parameters (num_partitions=512, num_sub_vectors=32) and visualizes the recall.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_7

LANGUAGE: python CODE:

# test NYT -- TF-IDF sparse vectors projected on to 256D dense -- normalized L2
data = _get_nyt_vectors()
data = data[np.linalg.norm(data, axis=1) != 0]
data = np.unique(data, axis=0)
data /= np.linalg.norm(data, axis=1)[:, None]

# use the same out of sample query


recall_data = run_test(
    data,
    query,
    "L2",
    num_partitions=512,
    num_sub_vectors=32,
)

make_plot(recall_data)

TITLE: Update Rows in Lance Dataset by SQL Expression DESCRIPTION: Demonstrates how to update specific columns of rows in a Lance dataset using the lance.LanceDataset.update method. The update values are SQL expressions, allowing for direct value assignment.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_4

LANGUAGE: python CODE:

import lance

dataset = lance.dataset("./alice_and_bob.lance")
dataset.update({"name": "'Bob'"}, where="name = 'Blob'")

TITLE: Iteratively Read Large Lance Dataset in Batches DESCRIPTION: This Python snippet demonstrates how to read a Lance dataset in batches, which is ideal for datasets too large to fit into memory. It uses to_batches() with column projection and filter push-down, allowing processing of data chunks iteratively.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_15

LANGUAGE: python CODE:

for batch in ds.to_batches(columns=["image"], filter="label = 10"):
    # do something with batch
    compute_on_batch(batch)

TITLE: Perform Upsert Operation (Update or Insert) in LanceDB DESCRIPTION: Shows how to combine when_matched_update_all() and when_not_matched_insert_all() within merge_insert to achieve an 'upsert' behavior. This operation updates rows if they exist and inserts them if they do not, providing a flexible way to synchronize data.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_9

LANGUAGE: python CODE:

import lance
import pyarrow as pa

# Change Carla's age and insert David
new_table = pa.Table.from_pylist([{"name": "Carla", "age": 27},
                                  {"name": "David", "age": 42}])

dataset = lance.dataset("./alice_and_bob.lance")

# This will update Carla and insert David
_ = dataset.merge_insert("name") \
       .when_matched_update_all() \
       .when_not_matched_insert_all() \
       .execute(new_table)
# Verify the results
print(dataset.to_table().to_pandas())
#     name  age
# 0  Alice   20
# 1    Bob   30
# 2  Carla   27
# 3  David   42

TITLE: Configure LanceDB for S3 Express One Zone Buckets DESCRIPTION: Shows how to explicitly configure LanceDB to access S3 Express One Zone (directory) buckets, especially when the bucket name is hidden by an access point or private link. This involves setting the region and s3_express flag in storage_options for direct access.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_6

LANGUAGE: python CODE:

import lance
ds = lance.dataset(
    "s3://my-bucket--use1-az4--x-s3/path/imagenet.lance",
    storage_options={
        "region": "us-east-1",
        "s3_express": "true",
    }
)

TITLE: Add Schema-Only Columns to Lance Dataset DESCRIPTION: Demonstrates how to add new columns to a Lance dataset without populating them, using pyarrow.Field or pyarrow.Schema. This operation is metadata-only and very efficient, useful for lazy population.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_0

LANGUAGE: python CODE:

table = pa.table({"id": pa.array([1, 2, 3])})
dataset = lance.write_dataset(table, "null_columns")

# With pyarrow Field
dataset.add_columns(pa.field("embedding", pa.list_(pa.float32(), 128)))
assert dataset.schema == pa.schema([
    ("id", pa.int64()),
    ("embedding", pa.list_(pa.float32(), 128)),
])

# With pyarrow Schema
dataset.add_columns(pa.schema([
    ("label", pa.string()),
    ("score", pa.float32()),
]))
assert dataset.schema == pa.schema([
    ("id", pa.int64()),
    ("embedding", pa.list_(pa.float32(), 128)),
    ("label", pa.string()),
    ("score", pa.float32()),
])

TITLE: Commit Collected Fragments to a Lance Dataset DESCRIPTION: After parallel writes, this snippet shows how to serialize fragment metadata from all workers, collect them on a single worker, and then commit them to a Lance dataset using lance.LanceOperation.Overwrite. It verifies the commit by reading the dataset and asserting its properties, demonstrating the final step of a distributed write.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_1

LANGUAGE: python CODE:

import json
from lance import FragmentMetadata, LanceOperation

# Serialize Fragments into JSON data
fragments_json1 = [json.dumps(fragment.to_json()) for fragment in fragments_1]
fragments_json2 = [json.dumps(fragment.to_json()) for fragment in fragments_2]

# On one worker, collect all fragments
all_fragments = [FragmentMetadata.from_json(f) for f in \
    fragments_json1 + fragments_json2]

# Commit the fragments into a single dataset
# Use LanceOperation.Overwrite to overwrite the dataset or create new dataset.
op = lance.LanceOperation.Overwrite(schema, all_fragments)
read_version = 0 # Because it is empty at the time.
lance.LanceDataset.commit(
    data_uri,
    op,
    read_version=read_version,
)

# We can read the dataset using the Lance API:
dataset = lance.dataset(data_uri)
assert len(dataset.get_fragments()) == 2
assert dataset.version == 1
print(dataset.to_table().to_pandas())

TITLE: Merge Pre-computed Columns into Lance Dataset DESCRIPTION: Explains how to integrate pre-computed columns into an existing Lance dataset using the merge method. This approach avoids rewriting the entire dataset by joining new data based on a specified column, as demonstrated with an 'id' column.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_3

LANGUAGE: python CODE:

table = pa.table({
   "id": pa.array([1, 2, 3]),
   "embedding": pa.array([np.array([1, 2, 3]), np.array([4, 5, 6]),
                          np.array([7, 8, 9])])
})
dataset = lance.write_dataset(table, "embeddings", mode="overwrite")

new_data = pa.table({
   "id": pa.array([1, 2, 3]),
   "label": pa.array(["horse", "rabbit", "cat"])
})
dataset.merge(new_data, "id")
print(dataset.to_table().to_pandas())

TITLE: SQL Filter Expression with Escaped Column Names DESCRIPTION: This SQL snippet shows how to handle column names that are SQL keywords or contain special characters (like spaces) by escaping them with backticks. It also demonstrates accessing nested fields with escaped names to ensure correct parsing.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_17

LANGUAGE: sql CODE:

`CUBE` = 10 AND `column name with space` IS NOT NULL
  AND `nested with space`.`inner with space` < 2

TITLE: LanceDB Page-level Statistics Schema Definition DESCRIPTION: This schema defines the structure for storing page-level statistics for each field (column) within a Lance file. It includes the null count, minimum value, and maximum value for each field, typed according to the field's original data type. The schema is flexible, allowing for missing fields and future extensions.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_15

LANGUAGE: APIDOC CODE:

<field_id_1>: struct
    null_count: i64
    min_value: <field_1_data_type>
    max_value: <field_1_data_type>
...
<field_id_N>: struct
    null_count: i64
    min_value: <field_N_data_type>
    max_value: <field_N_data_type>

TITLE: Define Custom TensorSpec for Lance TensorFlow Dataset Output DESCRIPTION: This code shows how to explicitly define the tf.TensorSpec for the output signature of a tf.data.Dataset created from Lance. This is crucial for precise type and shape control, especially when automatic inference is insufficient or for complex data structures like ragged tensors.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_2

LANGUAGE: python CODE:

batch_size = 256
ds = lance.tf.data.from_lance(
    "s3://my-bucket/my-dataset",
    columns=["image", "labels"],
    batch_size=batch_size,
    output_signature={
        "image": tf.TensorSpec(shape=(), dtype=tf.string),
        "labels": tf.RaggedTensorSpec(
            dtype=tf.int32, shape=(batch_size, None), ragged_rank=1),
    },

TITLE: SQL Literals for Date, Timestamp, and Decimal Types DESCRIPTION: This SQL snippet illustrates how to specify literals for date, timestamp, and decimal columns in Lance filter expressions. It shows the syntax for casting string values to specific data types, ensuring correct interpretation during query execution.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_18

LANGUAGE: sql CODE:

date_col = date '2021-01-01'
and timestamp_col = timestamp '2021-01-01 00:00:00'
and decimal_col = decimal(8,3) '1.000'

TITLE: Add New Columns to a Lance Dataset in a Distributed Manner DESCRIPTION: This snippet demonstrates adding new columns to a Lance dataset efficiently without copying existing data. It shows how to merge columns on individual fragments across workers using frag.merge_columns and then commit the changes using lance.LanceOperation.Merge on a single worker. This leverages Lance's two-dimensional layout for metadata-only operations, making column additions highly efficient.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_3

LANGUAGE: python CODE:

import lance
from pyarrow import RecordBatch
import pyarrow.compute as pc

dataset = lance.dataset("./add_columns_example")
assert len(dataset.get_fragments()) == 2
assert dataset.to_table().combine_chunks() == pa.Table.from_pydict({
    "name": ["alice", "bob", "charlie", "craig", "dave", "eve"],
    "age": [25, 33, 44, 55, 66, 77],
}, schema=schema)


def name_len(names: RecordBatch) -> RecordBatch:
    return RecordBatch.from_arrays(
        [pc.utf8_length(names["name"])],
        ["name_len"],
    )

# On Worker 1
frag1 = dataset.get_fragments()[0]
new_fragment1, new_schema = frag1.merge_columns(name_len, ["name"])

# On Worker 2
frag2 = dataset.get_fragments()[1]
new_fragment2, _ = frag2.merge_columns(name_len, ["name"])

# On Worker 3 - Commit
all_fragments = [new_fragment1, new_fragment2]
op = lance.LanceOperation.Merge(all_fragments, schema=new_schema)
lance.LanceDataset.commit(
    "./add_columns_example",
    op,
    read_version=dataset.version,
)

# Verify dataset
dataset = lance.dataset("./add_columns_example")
print(dataset.to_table().to_pandas())

TITLE: Plot Median Query Latency Histogram DESCRIPTION: This snippet generates a histogram of the median query latency using seaborn's displot function. It visualizes the distribution of the '50%' column (median response time) from the DataFrame and sets appropriate x and y axis labels.

SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_2

LANGUAGE: python CODE:

ax = sns.displot(df, x="50%")
ax.set(xlabel="Median response time seconds", ylabel="Number of configurations")

TITLE: Implement Custom PyTorch Sampler for Non-Overlapping Data DESCRIPTION: The LanceSampler class is a custom PyTorch Sampler designed to prevent overlapping samples during LLM training, which can lead to overfitting. It ensures that the indices returned are block_size apart, guaranteeing that each sample processed by the model is unique and non-redundant. The sampler pre-calculates and shuffles available indices, yielding them during iteration to provide distinct data chunks for each batch.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_3

LANGUAGE: python CODE:

class LanceSampler(Sampler):
    r"""Samples tokens randomly but `block_size` indices apart.

    Args:
        data_source (Dataset): dataset to sample from
        block_size (int): minimum index distance between each random sample
    """

    def __init__(self, data_source, block_size=512):
        self.data_source = data_source
        self.num_samples = len(self.data_source)
        self.available_indices = list(range(0, self.num_samples, block_size))
        np.random.shuffle(self.available_indices)

    def __iter__(self):
        yield from self.available_indices

    def __len__(self) -> int:
        return len(self.available_indices)

TITLE: Insert New Rows Only in LanceDB Dataset DESCRIPTION: Illustrates how to use merge_insert with when_not_matched_insert_all() to insert data only if it doesn't already exist in the dataset. This is useful for preventing duplicate entries when processing batches of data where some records might have been added previously.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_8

LANGUAGE: python CODE:

# Bob is already in the table, but Carla is new
new_table = pa.Table.from_pylist([{"name": "Bob", "age": 30},
                                  {"name": "Carla", "age": 37}])

dataset = lance.dataset("./alice_and_bob.lance")

# This will insert Carla but leave Bob unchanged
_ = dataset.merge_insert("name") \
       .when_not_matched_insert_all() \
       .execute(new_table)
# Verify that Carla was added but Bob remains unchanged
print(dataset.to_table().to_pandas())
#     name  age
# 0  Alice   20
# 1    Bob   30
# 2  Carla   37

TITLE: Replace Filtered Data with New Rows in LanceDB DESCRIPTION: Explains a less common but powerful use case of merge_insert to replace a specific region of existing rows (defined by a filter) with new data. This effectively acts as a combined delete and insert operation within a single transaction, using when_not_matched_by_source_delete().

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_10

LANGUAGE: python CODE:

import lance
import pyarrow as pa

new_table = pa.Table.from_pylist([{"name": "Edgar", "age": 46},
                                  {"name": "Francene", "age": 44}])

dataset = lance.dataset("./alice_and_bob.lance")
print(dataset.to_table().to_pandas())
#       name  age
# 0    Alice   20
# 1      Bob   30
# 2  Charlie   45
# 3    Donna   50

# This will remove anyone above 40 and insert our new data
_ = dataset.merge_insert("name") \
       .when_not_matched_insert_all() \
       .when_not_matched_by_source_delete("age >= 40") \
       .execute(new_table)
# Verify the results - people over 40 replaced with new data
print(dataset.to_table().to_pandas())
#        name  age
# 0     Alice   20
# 1       Bob   30
# 2     Edgar   46
# 3  Francene   44

TITLE: Distributed Training with Shuffled Lance Fragments in TensorFlow DESCRIPTION: This snippet outlines a strategy for distributed training by sharding and shuffling Lance fragments across multiple workers. It uses lance_fragments to manage the distribution of data, ensuring each worker processes a unique subset of the dataset for efficient parallel training.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_3

LANGUAGE: python CODE:

import tensorflow as tf
from lance.tf.data import from_lance, lance_fragments

world_size = 32
rank = 10
seed = 123  #
epoch = 100

dataset_uri = "s3://my-bucket/my-dataset"

# Shuffle fragments distributedly.
fragments =
    lance_fragments("s3://my-bucket/my-dataset")
    .shuffling(32, seed=seed)
    .repeat(epoch)
    .enumerate()
    .filter(lambda i, _: i % world_size == rank)
    .map(lambda _, fid: fid)

ds = from_lance(
    uri,
    columns=["image", "label"],
    fragments=fragments,
    batch_size=32
    )
for batch in ds:
    print(batch)

TITLE: LanceDB Deletion File Naming Convention DESCRIPTION: This snippet specifies the naming convention for deletion files in LanceDB, which are used to mark rows for deletion. It details the components of the filename, including fragment ID, read version, and a random ID, along with the file type suffix (Arrow or Roaring Bitmap).

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_9

LANGUAGE: text CODE:

_deletions/{fragment_id}-{read_version}-{random_id}.{arrow|bin}

TITLE: Convert NumPy BFloat16 Array to Lance Extension Arrays DESCRIPTION: This snippet demonstrates how to convert an existing NumPy array of bfloat16 dtype into Lance's PandasBFloat16Array or BFloat16Array. It showcases the interoperability between NumPy's ml_dtypes and Lance's extension arrays, facilitating data integration.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_2

LANGUAGE: python CODE:

import numpy as np
from ml_dtypes import bfloat16
from lance.arrow import PandasBFloat16Array, BFloat16Array

np_array = np.array([1.1, 2.1, 3.4], dtype=bfloat16)
PandasBFloat16Array.from_numpy(np_array)
# <PandasBFloat16Array>
# [1.1015625, 2.09375, 3.40625]
# Length: 3, dtype: lance.bfloat16
BFloat16Array.from_numpy(np_array)
# <lance.arrow.BFloat16Array object at 0x...>
# [
#   1.1015625,
#   2.09375,
#   3.40625
# ]

TITLE: Rename Nested Columns in LanceDB Dataset DESCRIPTION: This snippet demonstrates how to rename nested columns within a LanceDB dataset using lance.LanceDataset.alter_columns. It shows how to specify nested paths using dot notation (e.g., 'meta.id') and verifies the renaming by printing the dataset's content.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_6

LANGUAGE: python CODE:

data = [
  {"meta": {"id": 1, "name": "Alice"}},
  {"meta": {"id": 2, "name": "Bob"}}
]
schema = pa.schema([
    ("meta", pa.struct([
        ("id", pa.int32()),
        ("name", pa.string()),
    ]))
])
dataset = lance.write_dataset(data, "nested_rename")
dataset.alter_columns({"path": "meta.id", "name": "new_id"})
print(dataset.to_table().to_pandas())
#                                  meta
# 0  {'new_id': 1, 'name': 'Alice'}
# 1    {'new_id': 2, 'name': 'Bob'}

TITLE: Delete Rows from Lance Dataset by SQL Filter DESCRIPTION: Explains how to delete rows from a Lance dataset using a SQL-like filter expression with the LanceDataset.delete method. Note that this operation creates a new version of the dataset, requiring it to be reopened to see changes.

SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_3

LANGUAGE: python CODE:

import lance

dataset = lance.dataset("./alice_and_bob.lance")
dataset.delete("name = 'Bob'")
dataset2 = lance.dataset("./alice_and_bob.lance")
print(dataset2.to_table().to_pandas())
#     name  age
# 0  Alice   20