248 KiB
======================== CODE SNIPPETS
TITLE: Run LanceDB documentation examples tests DESCRIPTION: Checks the documentation examples for correctness and consistency, ensuring they function as expected.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_3
LANGUAGE: shell CODE:
make doctest
TITLE: Install documentation website requirements
DESCRIPTION: This command installs the necessary Python packages for building the main documentation website, which is powered by mkdocs-material
. It ensures all dependencies are met before serving the docs.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_7
LANGUAGE: bash CODE:
pip install -r docs/requirements.txt
TITLE: Build and serve documentation website locally
DESCRIPTION: These commands navigate to the docs
directory and start a local development server for the documentation website. This allows contributors to preview changes to the documentation in real-time.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_8
LANGUAGE: bash CODE:
cd docs
mkdocs serve
TITLE: Perform Python development installation
DESCRIPTION: These commands navigate into the python
directory and perform a development installation of the Lance Python bindings. This allows developers to import and test changes to the Python wrapper directly.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_1
LANGUAGE: bash CODE:
cd python
maturin develop
TITLE: Example output of git commit with pre-commit hooks DESCRIPTION: Demonstrates the console output when committing changes after pre-commit hooks are installed, showing the execution and status of linters like black, isort, and ruff.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_8
LANGUAGE: shell CODE:
git commit -m"Changed some python files"
black....................................................................Passed
isort (python)...........................................................Passed
ruff.....................................................................Passed
[main daf91ed] Changed some python files
1 file changed, 1 insertion(+), 1 deletion(-)
TITLE: Install LanceDB test dependencies DESCRIPTION: Installs the necessary Python packages for running tests, including optional test dependencies specified in the project's setup.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_1
LANGUAGE: shell CODE:
pip install '.[tests]'
TITLE: Install pre-commit tool for LanceDB
DESCRIPTION: Installs the pre-commit
tool, which enables running formatters and linters automatically before each Git commit.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_6
LANGUAGE: shell CODE:
pip install pre-commit
TITLE: Download and Extract SIFT 1M Dataset DESCRIPTION: This snippet provides shell commands to download and extract the SIFT 1M dataset, which is used as a large-scale example for vector search demonstrations. It includes commands to clean up previous downloads and extract the compressed archive.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_11
LANGUAGE: bash CODE:
rm -rf sift* vec_data.lance
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz
TITLE: Create Pandas DataFrame DESCRIPTION: This code demonstrates how to create a simple Pandas DataFrame. This DataFrame serves as a basic example for subsequent operations, such as writing data to a Lance dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_1
LANGUAGE: python CODE:
df = pd.DataFrame({"a": [5]})
df
TITLE: TPCH Benchmark Setup and Execution DESCRIPTION: This snippet outlines the steps to set up the dataset and run the TPCH Q1 benchmark comparing LanceDB and Parquet. It includes navigating to the benchmark directory, creating a dataset folder, downloading and renaming the necessary Parquet file, and executing the benchmark script. Note: The step to 'generate lance file' is a conceptual action within the benchmark process, not an explicit command provided.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/tpch/README.md#_snippet_0
LANGUAGE: Shell CODE:
cd lance/benchmarks/tpch
mkdir dataset && cd dataset
wget https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet -O lineitem_sf1.parquet
cd ..
LANGUAGE: Shell CODE:
python3 benchmark.py q1
TITLE: Install LanceDB pre-commit hooks DESCRIPTION: Installs the pre-commit hooks defined in the project's configuration, activating automatic linting and formatting on commit attempts.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_7
LANGUAGE: shell CODE:
pre-commit install
TITLE: Install Python bindings build tool
DESCRIPTION: This command installs maturin
, a tool essential for building Python packages that integrate with Rust code. It's a prerequisite for setting up the Python development environment for Lance.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_0
LANGUAGE: bash CODE:
pip install maturin
TITLE: Start Local Services for S3 Integration Tests DESCRIPTION: Before running S3 integration tests, you need to start local Minio and DynamoDB services. This command uses Docker Compose to bring up these required services, ensuring the test environment is ready.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_20
LANGUAGE: Shell CODE:
docker compose up
TITLE: Install preview pylance Python SDK via pip DESCRIPTION: Install the preview version of the pylance Python SDK to access the latest features and bug fixes. This uses a specific extra index URL for LanceDB's PyPI.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/install.md#_snippet_1
LANGUAGE: Bash CODE:
pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance
TITLE: Access Specific Lance Dataset Version
DESCRIPTION: This example demonstrates how to load and query a specific historical version of a Lance dataset. By specifying the version
parameter, users can access data as it existed at a particular point in time, enabling historical analysis or rollbacks.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_8
LANGUAGE: python CODE:
# Version 1
lance.dataset('/tmp/test.lance', version=1).to_table().to_pandas()
# Version 2
lance.dataset('/tmp/test.lance', version=2).to_table().to_pandas()
TITLE: Install stable pylance Python SDK via pip DESCRIPTION: Install the stable and recommended version of the pylance Python SDK using the pip package manager.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/install.md#_snippet_0
LANGUAGE: Bash CODE:
pip install pylance
TITLE: Run all LanceDB tests
DESCRIPTION: Executes the full test suite for the LanceDB project using the make test
command.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_2
LANGUAGE: shell CODE:
make test
TITLE: Install Linux Perf Tools and Configure Kernel Parameters
DESCRIPTION: Installs necessary Linux performance tools (perf
) on Ubuntu systems and configures the perf_event_paranoid
kernel parameter. This setup is crucial for allowing non-root users to collect performance data using tools like perf
and flamegraph
.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_4
LANGUAGE: sh CODE:
sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`
sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoid"
TITLE: Load Lance Vector Dataset DESCRIPTION: This snippet shows how to load a previously created Lance vector dataset. This step is essential before performing any vector search queries or other operations on the dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_13
LANGUAGE: python CODE:
uri = "vec_data.lance"
sift1m = lance.dataset(uri)
TITLE: Prepare Parquet File from Pandas DataFrame DESCRIPTION: This code prepares a Parquet file from a Pandas DataFrame using PyArrow. It cleans up any existing Parquet or Lance files to ensure a fresh start, then converts the DataFrame to a PyArrow Table and writes it as a Parquet dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_3
LANGUAGE: python CODE:
shutil.rmtree("/tmp/test.parquet", ignore_errors=True)
shutil.rmtree("/tmp/test.lance", ignore_errors=True)
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, "/tmp/test.parquet", format='parquet')
parquet = pa.dataset.dataset("/tmp/test.parquet")
parquet.to_table().to_pandas()
TITLE: Install required Python libraries
DESCRIPTION: Installs necessary Python packages for data handling, OpenAI API interaction, rate limiting, and LanceDB. The --quiet
flag suppresses verbose output during installation.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_0
LANGUAGE: python CODE:
pip install --quiet openai tqdm ratelimiter retry datasets pylance
TITLE: Run Rust unit tests DESCRIPTION: This command executes the unit tests for the Rust core format. Running these tests verifies the correctness of the Rust implementation.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_6
LANGUAGE: bash CODE:
cargo test
TITLE: Profile a LanceDB benchmark using flamegraph
DESCRIPTION: Generates a flamegraph for a specific benchmark using cargo-flamegraph
, aiding in performance analysis. It's recommended to run benchmarks once beforehand to avoid setup time being captured in the profile.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_14
LANGUAGE: shell CODE:
flamegraph -F 100 --no-inline -- $(which python) \
-m pytest python/benchmarks \
--benchmark-min-time=2 \
-k test_ivf_pq_index_search
TITLE: Install Flamegraph Tool
DESCRIPTION: Installs the flamegraph
profiling tool using Cargo, Rust's package manager. This tool is essential for visualizing CPU usage and call stacks as flame graphs for performance analysis.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_3
LANGUAGE: sh CODE:
cargo install flamegraph
TITLE: Set up BigANN Benchmark Environment DESCRIPTION: This snippet provides commands to set up a Python virtual environment, clone the 'big-ann-benchmarks' repository, and install its required dependencies. It prepares the system for running BigANN benchmarks by ensuring all necessary tools and libraries are in place.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/bigann/README.md#_snippet_0
LANGUAGE: bash CODE:
python -m venv venv
. ./venv/bin/activate
git clone https://github.com/harsha-simhadri/big-ann-benchmarks.git
cd big-ann-benchmarks
pip install -r requirements_py3.10.txt
TITLE: List Lance Dataset Versions DESCRIPTION: This code shows how to retrieve a list of all available versions for a Lance dataset. This functionality is crucial for understanding the history of changes and for accessing specific historical states of the data.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_7
LANGUAGE: python CODE:
dataset.versions()
TITLE: Install Lance Build Dependencies on Ubuntu DESCRIPTION: This command installs necessary system-level dependencies for building Lance on Ubuntu 22.04, including protobuf, SSL development libraries, and general build tools.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_0
LANGUAGE: bash CODE:
sudo apt install protobuf-compiler libssl-dev build-essential pkg-config gfortran
TITLE: Build Rust core format (release) DESCRIPTION: This command compiles the Rust core format in release mode. The release build is optimized for performance and is suitable for production deployments or benchmarking.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_5
LANGUAGE: bash CODE:
cargo build -r
TITLE: Debug Python Script with LLDB DESCRIPTION: Demonstrates how to start an LLDB debugging session for a Python script. It involves launching LLDB with the Python interpreter from a virtual environment and then running the target script within the LLDB prompt.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_2
LANGUAGE: sh CODE:
$ lldb ./venv/bin/python
(lldb) r script.py
TITLE: Install Lance Build Dependencies on Mac DESCRIPTION: This command installs the protobuf compiler using Homebrew, a required dependency for building Lance on macOS.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_1
LANGUAGE: bash CODE:
brew install protobuf
TITLE: Configure LLDB Initialization Settings
DESCRIPTION: Sets up basic LLDB initialization settings in the ~/.lldbinit
file. This includes configuring the number of source code lines to display before and after a stop, and enabling the loading of .lldbinit
files from the current working directory.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_0
LANGUAGE: lldb CODE:
# ~/.lldbinit
settings set stop-line-count-before 15
settings set stop-line-count-after 15
settings set target.load-cwd-lldbinit true
TITLE: List all versions of a Lance dataset DESCRIPTION: Retrieves and displays the version history of the Lance dataset, showing all previous and current states of the data.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_9
LANGUAGE: Python CODE:
dataset.versions()
TITLE: Load Lance Dataset DESCRIPTION: Initializes a Lance dataset object from a specified URI, preparing it for subsequent operations like nearest neighbor searches.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_20
LANGUAGE: python CODE:
sift1m = lance.dataset(uri)
TITLE: Complete Lance Dataset Write and Read Example in Rust
DESCRIPTION: This Rust main
function provides a complete example demonstrating the usage of write_dataset
and read_dataset
functions. It sets up the necessary arrow
and lance
imports, defines a temporary data path, and orchestrates the writing and subsequent reading of a Lance dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_2
LANGUAGE: Rust CODE:
use arrow::array::UInt32Array;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::{RecordBatch, RecordBatchIterator};
use futures::StreamExt;
use lance::dataset::{WriteMode, WriteParams};
use lance::Dataset;
use std::sync::Arc;
#[tokio::main]
async fn main() {
let data_path: &str = "./temp_data.lance";
write_dataset(data_path).await;
read_dataset(data_path).await;
}
TITLE: Rust: Main Workflow for WikiText to LanceDB Ingestion
DESCRIPTION: This comprehensive example demonstrates the full data ingestion pipeline in Rust. It initializes a Tokio runtime, loads a tokenizer, sets up the Hugging Face API to download WikiText Parquet files, processes them into a WikiTextBatchReader
, and finally writes the data to a Lance dataset. It also includes verification of the created dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_2
LANGUAGE: Rust CODE:
fn main() -> Result<(), Box<dyn Error + Send + Sync>> {
let rt = tokio::runtime::Runtime::new()?;
rt.block_on(async {
// Load tokenizer
let tokenizer = load_tokenizer("gpt2")?;
// Set up Hugging Face API
// Download from https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-raw-v1
let api = Api::new()?;
let repo = api.repo(Repo::with_revision(
"Salesforce/wikitext".into(),
RepoType::Dataset,
"main".into(),
));
// Define the parquet files we want to download
let train_files = vec![
"wikitext-103-raw-v1/train-00000-of-00002.parquet",
"wikitext-103-raw-v1/train-00001-of-00002.parquet",
];
let mut parquet_readers = Vec::new();
for file in &train_files {
println!("Downloading file: {}", file);
let file_path = repo.get(file)?;
let data = std::fs::read(file_path)?;
// Create a temporary file in the system temp directory and write the downloaded data to it
let mut temp_file = NamedTempFile::new()?;
temp_file.write_all(&data)?;
// Create the parquet reader builder with a larger batch size
let builder = ParquetRecordBatchReaderBuilder::try_new(temp_file.into_file())?
.with_batch_size(8192); // Increase batch size for better performance
parquet_readers.push(builder);
}
if parquet_readers.is_empty() {
println!("No parquet files found to process.");
return Ok(());
}
// Create batch reader
let num_samples: u64 = 500_000;
let batch_reader = WikiTextBatchReader::new(parquet_readers, tokenizer, Some(num_samples))?;
// Save as Lance dataset
println!("Writing to Lance dataset...");
let lance_dataset_path = "rust_wikitext_lance_dataset.lance";
let write_params = WriteParams::default();
lance::Dataset::write(batch_reader, lance_dataset_path, Some(write_params)).await?;
// Verify the dataset
let ds = lance::Dataset::open(lance_dataset_path).await?;
let scanner = ds.scan();
let mut stream = scanner.try_into_stream().await?;
let mut total_rows = 0;
while let Some(batch_result) = stream.next().await {
let batch = batch_result?;
total_rows += batch.num_rows();
}
println!(
"Lance dataset created successfully with {} rows",
total_rows
);
println!("Dataset location: {}", lance_dataset_path);
Ok(())
})
}
TITLE: Build and Test Pylance Python Package DESCRIPTION: These commands set up a Python virtual environment, install maturin for Rust-Python binding, build the Pylance package in debug mode, and then run its associated tests.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_3
LANGUAGE: bash CODE:
cd python
python3 -m venv venv
source venv/bin/activate
pip install maturin
# Build debug build
maturin develop --extras tests
# Run pytest
pytest python/tests/
TITLE: Install Lance using Cargo DESCRIPTION: Installs the Lance Rust library as a command-line tool using the Cargo package manager.
SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_0
LANGUAGE: shell CODE:
cargo install lance
TITLE: Append Data to Lance Dataset
DESCRIPTION: This example illustrates how to append new data to an existing Lance dataset. It creates a new Pandas DataFrame, converts it to a PyArrow Table, and then uses lance.write_dataset
with mode="append"
to add the new rows, creating a new version of the dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_5
LANGUAGE: python CODE:
df = pd.DataFrame({"a": [10]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="append")
dataset.to_table().to_pandas()
TITLE: Access Lance Dataset by Tag DESCRIPTION: This code demonstrates how to load a Lance dataset using a previously defined tag instead of a numerical version. This allows for more intuitive access to specific, meaningful versions of the data, improving readability and maintainability.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_10
LANGUAGE: python CODE:
lance.dataset('/tmp/test.lance', version="stable").to_table().to_pandas()
TITLE: Build pylance in release mode for benchmarks
DESCRIPTION: Builds the pylance
module in release mode with debug symbols, enabling benchmark execution and profiling. It includes benchmark-specific extras and features for data generation.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_10
LANGUAGE: shell CODE:
maturin develop --profile release-with-debug --extras benchmarks --features datagen
TITLE: Query Lance Dataset with Simple SQL in Rust DataFusion
DESCRIPTION: This Rust example demonstrates how to register a Lance dataset as a table in DataFusion using LanceTableProvider
and execute a simple SQL SELECT
query to retrieve the first 10 rows. It shows the basic setup for integrating Lance with DataFusion.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_0
LANGUAGE: rust CODE:
use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;
let ctx = SessionContext::new();
ctx.register_table("dataset",
Arc::new(LanceTableProvider::new(
Arc::new(dataset.clone()),
/* with_row_id */ false,
/* with_row_addr */ false,
)))?;
let df = ctx.sql("SELECT * FROM dataset LIMIT 10").await?;
let result = df.collect().await?;
TITLE: Install Lance Preview Release
DESCRIPTION: Installs a preview release of the pylance
library, which includes the latest features and bug fixes. Preview releases are published more frequently and offer early access to new developments.
SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_1
LANGUAGE: shell CODE:
pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance
TITLE: Install LanceDB and Python Dependencies DESCRIPTION: Installs specific versions of LanceDB, pandas, and duckdb required for running the benchmarks. This ensures compatibility and reproducibility of the benchmark results.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_0
LANGUAGE: sh CODE:
pip lancedb==0.3.6
pip install pandas~=2.1.0
pip duckdb~=0.9.0
TITLE: Prepare HD-Vila Dataset with Python venv
DESCRIPTION: This snippet outlines the steps to set up a Python virtual environment, activate it, and install necessary dependencies from requirements.txt
for the HD-Vila dataset. It ensures a clean and isolated environment for project dependencies.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/hd-vila/README.md#_snippet_0
LANGUAGE: python CODE:
python3 -m venv venv
. ./venv/bin/activate
pip install -r requirements.txt
TITLE: Run Python unit and integration tests DESCRIPTION: These commands execute the unit tests and integration tests for the Python components of the Lance project. Running these tests is crucial to ensure code changes do not introduce regressions.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_2
LANGUAGE: bash CODE:
make test
make integtest
TITLE: Import necessary libraries for LanceDB operations
DESCRIPTION: This snippet imports shutil
, lance
, numpy
, pandas
, and pyarrow
for file system operations, LanceDB interactions, numerical computing, data manipulation, and Arrow table handling, respectively.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_0
LANGUAGE: Python CODE:
import shutil
import lance
import numpy as np
import pandas as pd
import pyarrow as pa
TITLE: Create a Pandas DataFrame for LanceDB DESCRIPTION: Initializes a simple Pandas DataFrame with a single column 'a' and a value of 5. This DataFrame will be used as input for creating a Lance dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_1
LANGUAGE: Python CODE:
df = pd.DataFrame({"a": [5]})
df
TITLE: Sample Query Vectors from Lance Dataset DESCRIPTION: This code demonstrates how to sample a subset of vectors from the loaded Lance dataset to be used as query vectors for nearest neighbor search. It leverages DuckDB for efficient sampling of the vector column.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_14
LANGUAGE: python CODE:
import duckdb
# Make sure DuckDB v0.7+ is installed
samples = duckdb.query("SELECT vector FROM sift1m USING SAMPLE 100").to_df().vector
TITLE: Execute Tunable Nearest Neighbor Search DESCRIPTION: Demonstrates how to perform a nearest neighbor search with tunable parameters like 'nprobes' and 'refine_factor' to balance latency and recall. The result is converted to a Pandas DataFrame.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_22
LANGUAGE: python CODE:
%%time
sift1m.to_table(
nearest={
"column": "vector",
"q": samples[0],
"k": 10,
"nprobes": 10,
"refine_factor": 5
}
).to_pandas()
TITLE: Load SIFT vector dataset from Lance file
DESCRIPTION: Defines the URI for the Lance vector dataset and then loads it using lance.dataset()
, making the SIFT 1M vector data accessible for further operations.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_16
LANGUAGE: Python CODE:
uri = "vec_data.lance"
sift1m = lance.dataset(uri)
TITLE: Import LanceDB Libraries
DESCRIPTION: This snippet imports the necessary Python libraries for working with LanceDB, including shutil
for file operations, lance
for core LanceDB functionalities, numpy
for numerical operations, pandas
for data manipulation, and pyarrow
for data interchange.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_0
LANGUAGE: python CODE:
import shutil
import lance
import numpy as np
import pandas as pd
import pyarrow as pa
TITLE: Run all LanceDB benchmarks (including slow tests) DESCRIPTION: Executes all performance benchmarks, including those marked as 'slow', which may take a longer time to complete.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_12
LANGUAGE: shell CODE:
pytest python/benchmarks
TITLE: Prepare Python Virtual Environment for Benchmarks
DESCRIPTION: Creates and activates a Python virtual environment, then installs required packages from requirements.txt
. This isolates project dependencies and ensures a clean execution environment for the benchmark scripts.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_2
LANGUAGE: sh CODE:
python3 -m venv venv
. ./venv/bin/activate
pip install -r requirements.txt
TITLE: Create Tags for Lance Dataset Versions DESCRIPTION: This snippet illustrates how to create human-readable tags for specific versions of a Lance dataset. Tags provide a convenient way to mark and reference important dataset states, such as 'stable' or 'nightly' builds, simplifying version management.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_9
LANGUAGE: python CODE:
dataset.tags.create("stable", 2)
dataset.tags.create("nightly", 3)
dataset.tags.list()
TITLE: Run LanceDB code formatters
DESCRIPTION: Applies code formatting rules to the entire project. Specific commands like make format-python
or cargo fmt
can be used for language-specific formatting.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_4
LANGUAGE: shell CODE:
make format
TITLE: Build and Search HNSW Index for Vector Similarity in Rust
DESCRIPTION: This Rust code provides a complete example for vector similarity search. It defines a ground_truth
function for L2 distance calculation, create_test_vector_dataset
to generate synthetic fixed-size list vectors, and a main
function that orchestrates the process. The main
function generates or loads a dataset, builds an HNSW index using lance_index::vector::hnsw
, and then performs vector searches, measuring construction and search times, and calculating recall against ground truth.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/hnsw.md#_snippet_0
LANGUAGE: Rust CODE:
use std::collections::HashSet;
use std::sync::Arc;
use arrow::array::{types::Float32Type, Array, FixedSizeListArray};
use arrow::array::{AsArray, FixedSizeListBuilder, Float32Builder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchIterator;
use arrow_select::concat::concat;
use futures::stream::StreamExt;
use lance::Dataset;
use lance_index::vector::v3::subindex::IvfSubIndex;
use lance_index::vector::{
flat::storage::FlatFloatStorage,
hnsw::{builder::HnswBuildParams, HNSW},
};
use lance_linalg::distance::DistanceType;
fn ground_truth(fsl: &FixedSizeListArray, query: &[f32], k: usize) -> HashSet<u32> {
let mut dists = vec![];
for i in 0..fsl.len() {
let dist = lance_linalg::distance::l2_distance(
query,
fsl.value(i).as_primitive::<Float32Type>().values(),
);
dists.push((dist, i as u32));
}
dists.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap());
dists.truncate(k);
dists.into_iter().map(|(_, i)| i).collect()
}
pub async fn create_test_vector_dataset(output: &str, num_rows: usize, dim: i32) {
let schema = Arc::new(Schema::new(vec![Field::new(
"vector",
DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), dim),
false,
)]));
let mut batches = Vec::new();
// Create a few batches
for _ in 0..2 {
let v_builder = Float32Builder::new();
let mut list_builder = FixedSizeListBuilder::new(v_builder, dim);
for _ in 0..num_rows {
for _ in 0..dim {
list_builder.values().append_value(rand::random::<f32>());
}
list_builder.append(true);
}
let array = Arc::new(list_builder.finish());
let batch = RecordBatch::try_new(schema.clone(), vec![array]).unwrap();
batches.push(batch);
}
let batch_reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema.clone());
println!("Writing dataset to {}", output);
Dataset::write(batch_reader, output, None).await.unwrap();
}
#[tokio::main]
async fn main() {
let uri: Option<String> = None; // None means generate test data
let column = "vector";
let ef = 100;
let max_edges = 30;
let max_level = 7;
// 1. Generate a synthetic test data of specified dimensions
let dataset = if uri.is_none() {
println!("No uri is provided, generating test dataset...");
let output = "test_vectors.lance";
create_test_vector_dataset(output, 1000, 64).await;
Dataset::open(output).await.expect("Failed to open dataset")
} else {
Dataset::open(uri.as_ref().unwrap())
.await
.expect("Failed to open dataset")
};
println!("Dataset schema: {:#?}", dataset.schema());
let batches = dataset
.scan()
.project(&[column])
.unwrap()
.try_into_stream()
.await
.unwrap()
.then(|batch| async move { batch.unwrap().column_by_name(column).unwrap().clone() })
.collect::<Vec<_>>()
.await;
let arrs = batches.iter().map(|b| b.as_ref()).collect::<Vec<_>>();
let fsl = concat(&arrs).unwrap().as_fixed_size_list().clone();
println!("Loaded {:?} batches", fsl.len());
let vector_store = Arc::new(FlatFloatStorage::new(fsl.clone(), DistanceType::L2));
let q = fsl.value(0);
let k = 10;
let gt = ground_truth(&fsl, q.as_primitive::<Float32Type>().values(), k);
for ef_construction in [15, 30, 50] {
let now = std::time::Instant::now();
// 2. Build a hierarchical graph structure for efficient vector search using Lance API
let hnsw = HNSW::index_vectors(
vector_store.as_ref(),
HnswBuildParams::default()
.max_level(max_level)
.num_edges(max_edges)
.ef_construction(ef_construction),
)
.unwrap();
let construct_time = now.elapsed().as_secs_f32();
let now = std::time::Instant::now();
// 3. Perform vector search with different parameters and compute the ground truth using L2 distance search
let results: HashSet<u32> = hnsw
.search_basic(q.clone(), k, ef, None, vector_store.as_ref())
.unwrap()
.iter()
.map(|node| node.id)
.collect();
let search_time = now.elapsed().as_micros();
println!(
"level={}, ef_construct={}, ef={} recall={}: construct={:.3}s search={:.3} us",
max_level,
ef_construction,
ef,
results.intersection(>).count() as f32 / k as f32,
construct_time,
search_time
);
}
}
TITLE: LanceDB Nearest Neighbor Search Parameters DESCRIPTION: This section details the parameters available for tuning nearest neighbor searches in LanceDB, including 'q', 'k', 'nprobes', and 'refine_factor'.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_19
LANGUAGE: APIDOC CODE:
"nearest": {
"column": "string", // Name of the vector column
"q": "vector", // The query vector for nearest neighbor search
"k": "integer", // The number of nearest neighbors to return
"nprobes": "integer", // How many IVF partitions to search
"refine_factor": "integer" // Controls re-ranking: if k=10 and refine_factor=5, retrieves 50 nearest neighbors by ANN and re-sorts using actual distances, then returns top 10. Improves recall without sacrificing performance too much.
}
TITLE: Install Lance Python Library
DESCRIPTION: Installs the stable release of the pylance
library using pip, providing access to Lance's functionalities in Python.
SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_0
LANGUAGE: shell CODE:
pip install pylance
TITLE: Convert Parquet Dataset to Lance
DESCRIPTION: This snippet demonstrates the straightforward conversion of an existing PyArrow Parquet dataset into a Lance dataset. It uses lance.write_dataset
to perform the conversion and then verifies the content of the newly created Lance dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_4
LANGUAGE: python CODE:
dataset = lance.write_dataset(parquet, "/tmp/test.lance")
# Make sure it's the same
dataset.to_table().to_pandas()
TITLE: Convert Parquet dataset to Lance dataset DESCRIPTION: Converts an existing PyArrow Parquet dataset directly into a Lance dataset in a single line of code, demonstrating seamless integration.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_4
LANGUAGE: Python CODE:
dataset = lance.write_dataset(parquet, "/tmp/test.lance")
TITLE: Compare LanceDB benchmarks against previous version
DESCRIPTION: Provides a sequence of commands to compare the performance of the current version against the main
branch. This involves saving a baseline from main
and then comparing the current branch's performance against it.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_15
LANGUAGE: shell CODE:
CURRENT_BRANCH=$(git branch --show-current)
LANGUAGE: shell CODE:
git checkout main
LANGUAGE: shell CODE:
maturin develop --profile release-with-debug --features datagen
LANGUAGE: shell CODE:
pytest --benchmark-save=baseline python/benchmarks -m "not slow"
LANGUAGE: shell CODE:
COMPARE_ID=$(ls .benchmarks/*/ | tail -1 | cut -c1-4)
LANGUAGE: shell CODE:
git checkout $CURRENT_BRANCH
LANGUAGE: shell CODE:
maturin develop --profile release-with-debug --features datagen
LANGUAGE: shell CODE:
pytest --benchmark-compare=$COMPARE_ID python/benchmarks -m "not slow"
TITLE: Build Rust core format (debug) DESCRIPTION: This command compiles the Rust core format in debug mode. The debug build includes debugging information and is suitable for development and testing, though it is not optimized for performance.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_4
LANGUAGE: bash CODE:
cargo build
TITLE: Download and extract SIFT 1M dataset for vector operations
DESCRIPTION: Removes any existing SIFT files and then downloads the sift.tar.gz
archive from the specified FTP server. Finally, it extracts the contents of the tarball, preparing the SIFT 1M dataset for vector processing.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_14
LANGUAGE: Bash CODE:
!rm -rf sift* vec_data.lance
!wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
!tar -xzf sift.tar.gz
TITLE: Format and lint Rust code
DESCRIPTION: These commands are used to automatically format Rust code according to community standards (cargo fmt
) and to perform static analysis for potential issues (cargo clippy
). This ensures code quality and consistency.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_3
LANGUAGE: bash CODE:
cargo fmt --all
cargo clippy --all-features --tests --benches
TITLE: Run a specific LanceDB benchmark by name
DESCRIPTION: Filters and runs a particular benchmark using pytest's -k
flag, allowing substring matching for the benchmark name.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_13
LANGUAGE: shell CODE:
pytest python/benchmarks -k test_ivf_pq_index_search
TITLE: Run LanceDB code linters
DESCRIPTION: Executes code linters to check for style violations and potential issues. Language-specific linting can be performed with make lint-python
or make lint-rust
.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_5
LANGUAGE: shell CODE:
make lint
TITLE: Verify converted Lance dataset content DESCRIPTION: Reads the newly created Lance dataset and converts it back to a Pandas DataFrame to confirm that the data was correctly written and matches the original content.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_5
LANGUAGE: Python CODE:
# make sure it's the same
dataset.to_table().to_pandas()
TITLE: Prepare Dbpedia-entities-openai Dataset DESCRIPTION: This snippet provides shell commands to set up a Python virtual environment, install necessary dependencies from 'requirements.txt', and then generate the Dbpedia-entities-openai dataset in Lance format using 'datagen.py'. It requires Python 3.10 or newer.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/dbpedia-openai/README.md#_snippet_0
LANGUAGE: sh CODE:
# Python 3.10+
python3 -m venv venv
. ./venv/bin/activate
# install dependencies
pip install -r requirements.txt
# Generate dataset in lance format.
./datagen.py
TITLE: Clean LanceDB build artifacts DESCRIPTION: Removes all generated build artifacts and temporary files from the project directory, useful for a clean rebuild.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_9
LANGUAGE: shell CODE:
make clean
TITLE: Query Nearest Neighbors with Specific Features DESCRIPTION: Performs a nearest neighbor search while simultaneously retrieving specific feature columns ('revenue') alongside the vector results. This demonstrates fetching combined data in a single call.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_25
LANGUAGE: python CODE:
sift1m.to_table(columns=["revenue"], nearest={"column": "vector", "q": samples[0], "k": 10}).to_pandas()
TITLE: Create named tags for Lance dataset versions DESCRIPTION: Assigns human-readable tags ('stable', 'nightly') to specific versions (2 and 3) of the Lance dataset. Then, it lists all defined tags, providing aliases for version numbers.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_12
LANGUAGE: Python CODE:
dataset.tags.create("stable", 2)
dataset.tags.create("nightly", 3)
dataset.tags.list()
TITLE: Access Lance dataset using a named tag DESCRIPTION: Loads the Lance dataset by referencing a previously created tag ('stable') instead of a version number, and converts it to a Pandas DataFrame, showcasing tag-based version access.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_13
LANGUAGE: Python CODE:
lance.dataset('/tmp/test.lance', version="stable").to_table().to_pandas()
TITLE: Run LanceDB benchmarks (excluding slow tests)
DESCRIPTION: Executes the performance benchmarks located in python/benchmarks
, skipping tests explicitly marked as 'slow'. These benchmarks are designed for quick iteration and regression catching.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_11
LANGUAGE: shell CODE:
pytest python/benchmarks -m "not slow"
TITLE: Verify overwritten Lance dataset content DESCRIPTION: Reads the current state of the Lance dataset and converts it to a Pandas DataFrame to confirm that the overwrite operation was successful and the dataset now contains the new data.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_8
LANGUAGE: Python CODE:
dataset.to_table().to_pandas()
TITLE: Rust: Load Tokenizer from Hugging Face Hub
DESCRIPTION: This function provides a utility to load a tokenizer from the Hugging Face Hub. It takes a model name, creates an API client, retrieves the tokenizer file from the specified repository, and constructs a Tokenizer
object from it. This is a common pattern for integrating Hugging Face models.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_3
LANGUAGE: Rust CODE:
fn load_tokenizer(model_name: &str) -> Result<Tokenizer, Box<dyn Error + Send + Sync>> {
let api = Api::new()?;
let repo = api.repo(Repo::with_revision(
model_name.into(),
RepoType::Model,
"main".into(),
));
let tokenizer_path = repo.get("tokenizer.json")?;
let tokenizer = Tokenizer::from_file(tokenizer_path)?;
Ok(tokenizer)
}
TITLE: Sample query vectors from Lance dataset using DuckDB
DESCRIPTION: Imports duckdb
and queries the sift1m
Lance dataset to sample 100 vectors from the 'vector' column. The sampled vectors are converted to a Pandas DataFrame column, to be used as query inputs for KNN search.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_17
LANGUAGE: Python CODE:
import duckdb
# if this segfaults make sure duckdb v0.7+ is installed
samples = duckdb.query("SELECT vector FROM sift1m USING SAMPLE 100").to_df().vector
samples
TITLE: Prepare Parquet file for conversion to Lance
DESCRIPTION: Cleans up previous test files. Converts the Pandas DataFrame df
to a PyArrow Table, then writes it to a Parquet file. Finally, it reads the Parquet file back into a PyArrow dataset and converts it to a Pandas DataFrame for display.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_3
LANGUAGE: Python CODE:
shutil.rmtree("/tmp/test.parquet", ignore_errors=True)
shutil.rmtree("/tmp/test.lance", ignore_errors=True)
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, "/tmp/test.parquet", format='parquet')
parquet = pa.dataset.dataset("/tmp/test.parquet")
parquet.to_table().to_pandas()
TITLE: Access a specific historical version of Lance dataset (Version 2) DESCRIPTION: Loads another specific historical version (version 2) of the Lance dataset and converts it to a Pandas DataFrame, further illustrating the versioning capabilities.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_11
LANGUAGE: Python CODE:
lance.dataset('/tmp/test.lance', version=2).to_table().to_pandas()
TITLE: Lance I/O Trace Events DESCRIPTION: Describes events emitted during significant I/O operations, particularly those related to indices, useful for debugging cache utilization.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/performance.md#_snippet_1
LANGUAGE: APIDOC CODE:
Event: lance::io_events
Parameter: type
Description: The type of I/O operation (open_scalar_index, open_vector_index, load_vector_part, load_scalar_part)
TITLE: Import libraries and define dataset paths for Flickr8k
DESCRIPTION: This snippet imports essential Python libraries such as os
, cv2
, lance
, pyarrow
, matplotlib
, and tqdm
. It also defines the file paths for the Flickr8k captions file and the image dataset folder, which are crucial for subsequent data processing. It assumes the dataset and required libraries like pyarrow, pylance, opencv, and tqdm are already installed and present.
LANGUAGE: python CODE:
import os
import cv2
import random
import lance
import pyarrow as pa
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
captions = "Flickr8k.token.txt"
image_folder = "Flicker8k_Dataset/"
TITLE: Build IVF_PQ index on Lance vector dataset
DESCRIPTION: Builds an IVF_PQ (Inverted File Index with Product Quantization) index on the 'vector' column of the sift1m
dataset. It configures the index with 256 partitions and 16 sub-vectors for efficient approximate nearest neighbor search, significantly speeding up vector queries.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_19
LANGUAGE: Python CODE:
%%time
sift1m.create_index(
"vector",
index_type="IVF_PQ", # IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ are supported
num_partitions=256, # IVF
num_sub_vectors=16 # PQ
)
TITLE: Python Environment Setup for LanceDB Testing DESCRIPTION: Sets up the Python environment by ensuring the project's root directory is added to sys.path and preventing bytecode generation. This is crucial for module imports within the project structure.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_0
LANGUAGE: python CODE:
import sys
sys.dont_write_bytecode = True
import os
module_path = os.path.abspath(os.path.join('.'))
if module_path not in sys.path:
sys.path.append(module_path)
TITLE: Add Metadata Columns to Lance Table DESCRIPTION: Appends new feature columns, 'item_id' and 'revenue', to an existing Lance table. This illustrates how to enrich dataset entries with additional metadata before writing them back.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_23
LANGUAGE: python CODE:
tbl = sift1m.to_table()
tbl = tbl.append_column("item_id", pa.array(range(len(tbl))))
tbl = tbl.append_column("revenue", pa.array((np.random.randn(len(tbl))+5)*1000))
tbl.to_pandas()
TITLE: Build MacOS x86_64 Wheels
DESCRIPTION: This command builds release-mode wheels specifically for x86_64 MacOS. It uses maturin
to compile the project for the x86_64-apple-darwin
target, storing the resulting wheels in the 'wheels' directory.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_26
LANGUAGE: Shell CODE:
maturin build --release \
--target x86_64-apple-darwin \
--out wheels
TITLE: Overwrite Lance dataset to create new version
DESCRIPTION: Creates a new Pandas DataFrame with different data. Converts it to a PyArrow Table and overwrites the existing Lance dataset at /tmp/test.lance
using mode="overwrite"
, effectively creating a new version of the dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_7
LANGUAGE: Python CODE:
df = pd.DataFrame({"a": [50, 100]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="overwrite")
TITLE: Run Dbpedia-entities-openai Benchmark DESCRIPTION: This command executes the 'benchmarks.py' script to run top-k vector queries. The script tests various combinations of IVF and PQ values, as well as 'refine_factor', to evaluate performance. The example specifies a top-k value of 20.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/dbpedia-openai/README.md#_snippet_1
LANGUAGE: sh CODE:
./benchmarks.py -k 20
TITLE: Build and Test Lance Rust Package DESCRIPTION: These commands clone the Lance repository, navigate to the Rust directory, and then build, test, and benchmark the core Rust components of Lance.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_2
LANGUAGE: bash CODE:
git checkout https://github.com/lancedb/lance.git
# Build rust package
cd rust
cargo build
# Run test
cargo test
# Run benchmarks
cargo bench
TITLE: Query Lance Dataset with Simple SQL in Python DataFusion
DESCRIPTION: This Python example shows how to integrate Lance datasets with DataFusion using FFILanceTableProvider
from pylance
. It demonstrates registering a Lance dataset as a table and executing a basic SQL SELECT
query to fetch the first 10 rows, highlighting the Python FFI integration.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_2
LANGUAGE: python CODE:
from datafusion import SessionContext # pip install datafusion
from lance import FFILanceTableProvider
ctx = SessionContext()
table1 = FFILanceTableProvider(
my_lance_dataset, with_row_id=True, with_row_addr=True
)
ctx.register_table_provider("table1", table1)
ctx.table("table1")
ctx.sql("SELECT * FROM table1 LIMIT 10")
TITLE: Open a LanceDB Dataset
DESCRIPTION: Provides a basic example of how to open an existing Lance dataset using the lance.dataset
function. This function can be used to access datasets stored locally or in cloud storage like S3.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_11
LANGUAGE: python CODE:
import lance
ds = lance.dataset("s3://bucket/path/imagenet.lance")
TITLE: Build LanceDB in development mode
DESCRIPTION: Builds the Rust native module in place using maturin
. This command needs to be re-run whenever Rust code changes, but is not required for Python code modifications.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_0
LANGUAGE: shell CODE:
maturin develop
TITLE: Lance File Audit Trace Events DESCRIPTION: Details the events emitted when significant files are created or deleted in Lance, including the mode of I/O operation and the type of file affected.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/performance.md#_snippet_0
LANGUAGE: APIDOC CODE:
Event: lance::file_audit
Parameter: mode
Description: The mode of I/O operation (create, delete, delete_unverified)
Parameter: type
Description: The type of file affected (manifest, data file, index file, deletion file)
TITLE: Download Lindera Language Model
DESCRIPTION: Command-line instruction to download a specific Lindera language model (e.g., ipadic, ko-dic, unidic) for LanceDB. Note that lindera-cli
must be installed beforehand as Lindera models require compilation.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_4
LANGUAGE: bash CODE:
python -m lance.download lindera -l [ipadic|ko-dic|unidic]
TITLE: Access a specific historical version of Lance dataset (Version 1) DESCRIPTION: Loads a specific historical version (version 1) of the Lance dataset and converts it to a Pandas DataFrame, demonstrating the ability to revert to or inspect past states of the data.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_10
LANGUAGE: Python CODE:
lance.dataset('/tmp/test.lance', version=1).to_table().to_pandas()
TITLE: Decorate Rust Unit Test for Tracing
DESCRIPTION: To enable tracing for a Rust unit test, decorate it with the #[lance_test_macros::test]
attribute. This macro wraps any existing test attributes, allowing tracing information to be collected during test execution.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_16
LANGUAGE: Rust CODE:
#[lance_test_macros::test(tokio::test)]
async fn test() {
...
}
TITLE: Add Rust Toolchain Targets for Cross-Compilation DESCRIPTION: To build manylinux wheels for different Linux architectures, you must first add the corresponding Rust toolchain targets. These commands add the x86_64 and aarch64 GNU targets, enabling cross-compilation.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_22
LANGUAGE: Shell CODE:
rustup target add x86_64-unknown-linux-gnu
rustup target add aarch64-unknown-linux-gnu
TITLE: Query Vectors and Metadata Together in LanceDB DESCRIPTION: This code demonstrates how to perform a nearest neighbor search in LanceDB while simultaneously retrieving specified metadata columns. It allows users to fetch both vector embeddings and associated feature data ('item_id', 'revenue') in a single query, streamlining data retrieval for applications requiring both.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_21
LANGUAGE: python CODE:
result = sift1m.to_table(
columns=["item_id", "revenue"],
nearest={"column": "vector", "q": samples[0], "k": 10}
)
print(result.to_pandas())
TITLE: Build MacOS ARM64 Wheels
DESCRIPTION: This command builds release-mode wheels specifically for ARM64 (aarch64) MacOS. It uses maturin
to compile the project for the aarch64-apple-darwin
target, storing the resulting wheels in the 'wheels' directory.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_25
LANGUAGE: Shell CODE:
maturin build --release \
--target aarch64-apple-darwin \
--out wheels
TITLE: Rust: WikiTextBatchReader Next Batch Logic
DESCRIPTION: This snippet shows the core logic for the next
method of the WikiTextBatchReader
. It attempts to build and retrieve the next Parquet reader from a list of available readers. If a reader is successfully built, it's used; otherwise, it handles errors or indicates that no more readers are available.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_1
LANGUAGE: Rust CODE:
if let Some(builder) = self.parquet_readers[self.current_reader_idx].take() {
match builder.build() {
Ok(reader) => {
self.current_reader = Some(Box::new(reader));
self.current_reader_idx += 1;
continue;
}
Err(e) => {
return Some(Err(arrow::error::ArrowError::ExternalError(Box::new(e))))
}
}
}
}
// No more readers available
return None;
}
TITLE: Download and Extract SIFT1M Dataset DESCRIPTION: Downloads the SIFT1M dataset, a common benchmark for vector search, and extracts its contents. This is a prerequisite step for running the subsequent vector search examples.
SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_6
LANGUAGE: shell CODE:
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz
TITLE: Measure Nearest Neighbor Query Performance DESCRIPTION: Performs multiple nearest neighbor queries on the Lance dataset using a list of sample vectors and measures the average query time. It also prints the resulting table for the last query.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_21
LANGUAGE: python CODE:
import time
tot = 0
for q in samples:
start = time.time()
tbl = sift1m.to_table(nearest={"column": "vector", "q": q, "k": 10})
end = time.time()
tot += (end - start)
print(f"Avg(sec): {tot / len(samples)}")
print(tbl.to_pandas())
TITLE: Run Rust Unit Test with Tracing Verbosity
DESCRIPTION: Execute a Rust unit test with tracing enabled by setting the LANCE_TESTING
environment variable to a desired verbosity level (e.g., 'debug', 'info'). This command will generate a JSON trace file in your working directory, which can be viewed in Chrome or Perfetto.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_17
LANGUAGE: Bash CODE:
LANCE_TESTING=debug cargo test dataset::tests::test_create_dataset
TITLE: Build Linux x86_64 Manylinux Wheels
DESCRIPTION: This command builds release-mode manylinux wheels for x86_64 Linux. It utilizes maturin
with zig
for cross-compilation, targeting manylinux2014
compatibility, and outputs the generated wheels to the 'wheels' directory.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_23
LANGUAGE: Shell CODE:
maturin build --release --zig \
--target x86_64-unknown-linux-gnu \
--compatibility manylinux2014 \
--out wheels
TITLE: Append new rows to an existing Lance dataset
DESCRIPTION: Creates a new Pandas DataFrame with a single row. Converts it to a PyArrow Table and appends it to the existing Lance dataset at /tmp/test.lance
using mode="append"
.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_6
LANGUAGE: Python CODE:
df = pd.DataFrame({"a": [10]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="append")
dataset.to_table().to_pandas()
TITLE: Overwrite Lance Dataset
DESCRIPTION: This snippet demonstrates how to completely overwrite the data in a Lance dataset, effectively creating a new version. A new Pandas DataFrame is prepared and written to the dataset using mode="overwrite"
, replacing the previous content while preserving the old version for historical access.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_6
LANGUAGE: python CODE:
df = pd.DataFrame({"a": [50, 100]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="overwrite")
dataset.to_table().to_pandas()
TITLE: Lance Execution Trace Events DESCRIPTION: Outlines events emitted when an execution plan is run, providing insights into query performance, including output rows, I/O operations, bytes read, and index statistics.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/performance.md#_snippet_2
LANGUAGE: APIDOC CODE:
Event: lance::execution
Parameter: type
Description: The type of execution event (plan_run is the only type today)
Parameter: output_rows
Description: The number of rows in the output of the plan
Parameter: iops
Description: The number of I/O operations performed by the plan
Parameter: bytes_read
Description: The number of bytes read by the plan
Parameter: indices_loaded
Description: The number of indices loaded by the plan
Parameter: parts_loaded
Description: The number of index partitions loaded by the plan
Parameter: index_comparisons
Description: The number of comparisons performed inside the various indices
TITLE: Example Console Output of CLIP Model Training Progress DESCRIPTION: This snippet shows a typical console output during the training of the CLIP model. It displays the epoch number, the progress bar indicating batch processing, and the reported loss value for each epoch, demonstrating the training's progression and convergence.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_10
LANGUAGE: console CODE:
==================== Epoch: 1 / 2 ====================
loss: 2.0799: 100%|██████████| 253/253 [02:14<00:00, 1.88it/s]
==================== Epoch: 2 / 2 ====================
loss: 1.3064: 100%|██████████| 253/253 [02:10<00:00, 1.94it/s]
TITLE: Convert SIFT Data to Lance Vector Dataset
DESCRIPTION: This code demonstrates how to convert the raw SIFT 1M dataset, stored in a binary format, into a Lance vector dataset. It involves reading the binary data, reshaping it into a NumPy array, and then using vec_to_table
and lance.write_dataset
to store it efficiently for vector search.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_12
LANGUAGE: python CODE:
from lance.vector import vec_to_table
import struct
uri = "vec_data.lance"
with open("sift/sift_base.fvecs", mode="rb") as fobj:
buf = fobj.read()
data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * 1000000 * 128])).reshape((1000000, 128))
dd = dict(zip(range(1000000), data))
table = vec_to_table(dd)
lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)
TITLE: Perform KNN Search on Lance Dataset (No Index) DESCRIPTION: This snippet illustrates how to perform a K-Nearest Neighbors (KNN) search on a Lance dataset without utilizing an index. It measures the execution time to highlight the performance implications of a full dataset scan, demonstrating the need for ANN indexes in real-time scenarios.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_15
LANGUAGE: python CODE:
import time
start = time.time()
tbl = sift1m.to_table(columns=["id"], nearest={"column": "vector", "q": samples[0], "k": 10})
end = time.time()
print(f"Time(sec): {end-start}")
print(tbl.to_pandas())
TITLE: Build Linux ARM64 Manylinux Wheels
DESCRIPTION: This command builds release-mode manylinux wheels for ARM64 (aarch64) Linux. It uses maturin
with zig
for cross-compilation, targeting manylinux2014
compatibility, and places the output wheels in the 'wheels' directory.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_24
LANGUAGE: Shell CODE:
maturin build --release --zig \
--target aarch_64-unknown-linux-gnu \
--compatibility manylinux2014 \
--out wheels
TITLE: Overwrite Lance Dataset with New Features DESCRIPTION: Writes the modified table, including newly added feature columns, back to the Lance dataset URI, overwriting the existing dataset. This updates the dataset with enriched data.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_24
LANGUAGE: python CODE:
sift1m = lance.write_dataset(tbl, uri, mode="overwrite")
TITLE: Append Metadata Columns to LanceDB Dataset DESCRIPTION: This Python snippet illustrates how to append additional metadata columns, such as 'item_id' and 'revenue', to an existing LanceDB dataset. This allows for storing and managing feature data alongside vector embeddings within the same dataset, simplifying data management.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_20
LANGUAGE: python CODE:
tbl = sift1m.to_table()
tbl = tbl.append_column("item_id", pa.array(range(len(tbl))))
tbl = tbl.append_column("revenue", pa.array((np.random.randn(len(tbl))+5)*1000))
TITLE: Create Vector Index in LanceDB (IVF_PQ) DESCRIPTION: This code demonstrates how to create a vector index on a LanceDB dataset. It specifies the vector column, index type (IVF_PQ, IVF_HNSW_PQ, IVF_HNSW_SQ are supported), number of partitions for IVF, and number of sub-vectors for PQ. This improves the efficiency of Approximate Nearest Neighbor (ANN) searches.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_16
LANGUAGE: python CODE:
sift1m.create_index(
"vector",
index_type="IVF_PQ", # IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ are supported
num_partitions=256, # IVF
num_sub_vectors=16, # PQ
)
TITLE: Convert SIFT FVECS data to Lance vector dataset
DESCRIPTION: Imports vec_to_table
from lance.vector
and struct
. Reads the SIFT base vectors from sift_base.fvecs
, unpacks the binary data into a NumPy array, and converts it into a PyArrow Table using vec_to_table
. Finally, it writes this table to a Lance dataset named vec_data.lance
, optimizing for vector storage.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_15
LANGUAGE: Python CODE:
from lance.vector import vec_to_table
import struct
uri = "vec_data.lance"
with open("sift/sift_base.fvecs", mode="rb") as fobj:
buf = fobj.read()
data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * 1000000 * 128])).reshape((1000000, 128))
dd = dict(zip(range(1000000), data))
table = vec_to_table(dd)
lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)
TITLE: Read Lance Dataset in Java
DESCRIPTION: This Java snippet demonstrates how to open and access an existing Lance dataset. It uses Dataset.open
with the dataset's path and a BufferAllocator
to load the dataset. Once opened, it shows how to retrieve basic information such as row count, schema, and version details, providing a starting point for data querying and manipulation.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_3
LANGUAGE: Java CODE:
void readDataset() {
String datasetPath = ""; // specify a path point to a dataset
try (BufferAllocator allocator = new RootAllocator()) {
try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
dataset.countRows();
dataset.getSchema();
dataset.version();
dataset.latestVersion();
// access more information
}
}
}
TITLE: Execute Python S3 Integration Tests
DESCRIPTION: Once local S3 services are running, this command executes the Python S3 integration tests using pytest
. The --run-integration
flag ensures that tests requiring external services are included, specifically targeting the test_s3_ddb.py
file.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_21
LANGUAGE: Shell CODE:
pytest --run-integration python/tests/test_s3_ddb.py
TITLE: Perform Random Access on Lance Dataset in Java
DESCRIPTION: This Java example demonstrates how to perform random access queries on a Lance dataset, retrieving specific rows and columns. It opens an existing dataset, specifies a list of row indices and desired column names, and then uses dataset.take
to fetch the corresponding data. The results are processed using an ArrowReader
to iterate through batches and access individual field values.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_5
LANGUAGE: Java CODE:
void randomAccess() {
String datasetPath = ""; // specify a path point to a dataset
try (BufferAllocator allocator = new RootAllocator()) {
try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
List<Long> indices = Arrays.asList(1L, 4L);
List<String> columns = Arrays.asList("id", "name");
try (ArrowReader reader = dataset.take(indices, columns)) {
while (reader.loadNextBatch()) {
VectorSchemaRoot result = reader.getVectorSchemaRoot();
result.getRowCount();
for (int i = 0; i < indices.size(); i++) {
result.getVector("id").getObject(i);
result.getVector("name").getObject(i);
}
}
}
}
}
}
TITLE: Load Subset of Lance Dataset with Projection and Predicates
DESCRIPTION: This Python example illustrates how to efficiently load a subset of a Lance dataset into memory. It utilizes column projection (columns
), filter push-down (filter
), and pagination (limit
, offset
) to optimize data retrieval for large datasets by reducing I/O.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_14
LANGUAGE: python CODE:
table = ds.to_table(
columns=["image", "label"],
filter="label = 2 AND text IS NOT NULL",
limit=1000,
offset=3000)
TITLE: Create PyTorch DataLoader from LanceDataset (Unsafe)
DESCRIPTION: This example shows how to load a Lance dataset into a PyTorch IterableDataset
using lance.torch.data.LanceDataset
and then create a standard PyTorch DataLoader
. It highlights an inference loop, but notes that this approach is not fork-safe for multiprocessing.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_1
LANGUAGE: python CODE:
import torch
import lance.torch.data
# Load lance dataset into a PyTorch IterableDataset.
# with only columns "image" and "prompt".
dataset = lance.torch.data.LanceDataset(
"diffusiondb_train.lance",
columns=["image", "prompt"],
batch_size=128,
batch_readahead=8, # Control multi-threading reads.
)
# Create a PyTorch DataLoader
dataloader = torch.utils.data.DataLoader(dataset)
# Inference loop
for batch in dataloader:
inputs, targets = batch["prompt"], batch["image"]
outputs = model(inputs)
...
TITLE: Manage LanceDB Dataset Tags (Create, Update, Delete, List)
DESCRIPTION: This Python example demonstrates how to interact with LanceDataset.tags
to manage dataset versions. It covers creating a tag for a specific version, updating its associated version, listing all tags, and finally deleting a tag. It also shows how list_ordered()
can be used to retrieve tags in the order they were created or last updated.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tags.md#_snippet_0
LANGUAGE: python CODE:
import lance
ds = lance.dataset("./tags.lance")
print(len(ds.versions()))
# 2
print(ds.tags.list())
# {}
ds.tags.create("v1-prod", 1)
print(ds.tags.list())
# {'v1-prod': {'version': 1, 'manifest_size': ...}}
ds.tags.update("v1-prod", 2)
print(ds.tags.list())
# {'v1-prod': {'version': 2, 'manifest_size': ...}}
ds.tags.delete("v1-prod")
print(ds.tags.list())
# {}
print(ds.tags.list_ordered())
# []
ds.tags.create("v1-prod", 1)
print(ds.tags.list_ordered())
# [('v1-prod', {'version': 1, 'manifest_size': ...})]
ds.tags.update("v1-prod", 2)
print(ds.tags.list_ordered())
# [('v1-prod', {'version': 2, 'manifest_size': ...})]
ds.tags.delete("v1-prod")
print(ds.tags.list_ordered())
# []
TITLE: Write Pandas DataFrame to Lance Dataset
DESCRIPTION: Removes any existing Lance dataset at /tmp/test.lance
to ensure a clean write. Then, it writes the Pandas DataFrame df
to a new Lance dataset and converts the resulting dataset back to a Pandas DataFrame for verification.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_2
LANGUAGE: Python CODE:
shutil.rmtree("/tmp/test.lance", ignore_errors=True)
dataset = lance.write_dataset(df, "/tmp/test.lance")
dataset.to_table().to_pandas()
TITLE: Perform K-Nearest Neighbors search without an index
DESCRIPTION: Measures the time taken to perform a K-Nearest Neighbors (KNN) search on the sift1m
dataset. It queries for the 10 nearest neighbors to the first sampled vector (samples[0]
) based on the 'vector' column, demonstrating a full scan approach.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/quickstart.ipynb#_snippet_18
LANGUAGE: Python CODE:
import time
start = time.time()
tbl = sift1m.to_table(columns=["id"], nearest={"column": "vector", "q": samples[0], "k": 10})
end = time.time()
print(f"Time(sec): {end-start}")
print(tbl.to_pandas())
TITLE: Write Pandas DataFrame to Lance Dataset
DESCRIPTION: This snippet shows how to persist a Pandas DataFrame into a Lance dataset. It first ensures a clean state by removing any existing file and then uses lance.write_dataset
to save the DataFrame, followed by reading it back to confirm the write operation.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_2
LANGUAGE: python CODE:
shutil.rmtree("/tmp/test.lance", ignore_errors=True)
dataset = lance.write_dataset(df, "/tmp/test.lance")
dataset.to_table().to_pandas()
TITLE: Join Multiple Lance Datasets with SQL in Rust DataFusion
DESCRIPTION: This Rust example illustrates how to register multiple Lance datasets (e.g., 'orders' and 'customers') as separate tables in DataFusion. It then performs a SQL JOIN
operation between these tables to combine data based on a common key, demonstrating more complex query capabilities.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_1
LANGUAGE: rust CODE:
use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;
let ctx = SessionContext::new();
ctx.register_table("orders",
Arc::new(LanceTableProvider::new(
Arc::new(orders_dataset.clone()),
/* with_row_id */ false,
/* with_row_addr */ false,
)))?;
ctx.register_table("customers",
Arc::new(LanceTableProvider::new(
Arc::new(customers_dataset.clone()),
/* with_row_id */ false,
/* with_row_addr */ false,
)))?;
let df = ctx.sql("
SELECT o.order_id, o.amount, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
LIMIT 10
").await?;
let result = df.collect().await?;
TITLE: Read ImageURIs into Lance EncodedImageArray
DESCRIPTION: This example shows how to use ImageURIArray.read_uris()
to load images referenced by URIs into memory. The method returns an EncodedImageArray
containing the binary data of the images, enabling direct processing of image content.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_4
LANGUAGE: python CODE:
from lance.arrow import ImageURIArray
relative_path = "images/1.png"
uris = [os.path.join(os.path.dirname(__file__), relative_path)]
ImageURIArray.from_uris(uris).read_uris()
# <lance.arrow.EncodedImageArray object at 0x...>
# [b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00...']
TITLE: Create and Write Lance Dataset from Arrow Stream in Java
DESCRIPTION: This Java example illustrates how to create a Lance dataset and populate it with data from an existing Arrow file. It reads bytes from a source path, converts them into an ArrowArrayStream
, and then uses Dataset.create
with WriteParams
to configure writing options like maxRowsPerFile
, maxRowsPerGroup
, and WriteMode
. This method is suitable for ingesting data from Arrow-formatted sources.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_2
LANGUAGE: Java CODE:
void createAndWriteDataset() throws IOException, URISyntaxException {
Path path = ""; // the original source path
String datasetPath = ""; // specify a path point to a dataset
try (BufferAllocator allocator = new RootAllocator();
ArrowFileReader reader =
new ArrowFileReader(
new SeekableReadChannel(
new ByteArrayReadableSeekableByteChannel(Files.readAllBytes(path))), allocator);
ArrowArrayStream arrowStream = ArrowArrayStream.allocateNew(allocator)) {
Data.exportArrayStream(allocator, reader, arrowStream);
try (Dataset dataset =
Dataset.create(
allocator,
arrowStream,
datasetPath,
new WriteParams.Builder()
.withMaxRowsPerFile(10)
.withMaxRowsPerGroup(20)
.withMode(WriteParams.WriteMode.CREATE)
.withStorageOptions(new HashMap<>())
.build())) {
// access dataset
}
}
}
TITLE: Generate Flame Graph from Process ID DESCRIPTION: Generates a flame graph for a running process using its Process ID (PID). This command is used to capture and visualize CPU profiles, helping to identify performance bottlenecks in an application.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_5
LANGUAGE: sh CODE:
flamegraph -p <PID>
TITLE: Create Lance BFloat16 Arrow Array
DESCRIPTION: This example illustrates how to construct a BFloat16Array
directly using the lance.arrow.bfloat16_array
function. It takes a list of floating-point numbers and converts them into an Arrow array with BFloat16 precision, suitable for Arrow-based data processing.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_1
LANGUAGE: python CODE:
from lance.arrow import bfloat16_array
bfloat16_array([1.1, 2.1, 3.4])
# <lance.arrow.BFloat16Array object at 0x000000016feb94e0>
# [
# 1.1015625,
# 2.09375,
# 3.40625
# ]
TITLE: Clone LanceDB GitHub Repository DESCRIPTION: Instructions to clone the LanceDB project repository from GitHub to a local machine. This is the first step for setting up the development environment.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_11
LANGUAGE: shell CODE:
git clone https://github.com/lancedb/lance.git
TITLE: Rust Implementation of WikiTextBatchReader
DESCRIPTION: This Rust code defines WikiTextBatchReader
, a custom implementation of arrow::record_batch::RecordBatchReader
. It's designed to read text data from Parquet files, tokenize it using a Tokenizer
from the tokenizers
crate, and transform it into Arrow RecordBatch
es. The process_batch
method handles tokenization, limits the number of samples, and shuffles the tokenized IDs before creating the final RecordBatch
.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_0
LANGUAGE: rust CODE:
use arrow::array::{Array, Int64Builder, ListBuilder, UInt32Array};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchReader;
use futures::StreamExt;
use hf_hub::{api::sync::Api, Repo, RepoType};
use lance::dataset::WriteParams;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use rand::seq::SliceRandom;
use rand::SeedableRng;
use std::error::Error;
use std::fs::File;
use std::io::Write;
use std::sync::Arc;
use tempfile::NamedTempFile;
use tokenizers::Tokenizer;
// Implement a custom stream batch reader
struct WikiTextBatchReader {
schema: Arc<Schema>,
parquet_readers: Vec<Option<ParquetRecordBatchReaderBuilder<File>>>,
current_reader_idx: usize,
current_reader: Option<Box<dyn RecordBatchReader + Send>>,
tokenizer: Tokenizer,
num_samples: u64,
cur_samples_cnt: u64,
}
impl WikiTextBatchReader {
fn new(
parquet_readers: Vec<ParquetRecordBatchReaderBuilder<File>>,
tokenizer: Tokenizer,
num_samples: Option<u64>,
) -> Result<Self, Box<dyn Error + Send + Sync>> {
let schema = Arc::new(Schema::new(vec![Field::new(
"input_ids",
DataType::List(Arc::new(Field::new("item", DataType::Int64, true))),
false,
)]));
Ok(Self {
schema,
parquet_readers: parquet_readers.into_iter().map(Some).collect(),
current_reader_idx: 0,
current_reader: None,
tokenizer,
num_samples: num_samples.unwrap_or(100_000),
cur_samples_cnt: 0,
})
}
fn process_batch(
&mut self,
input_batch: &RecordBatch,
) -> Result<RecordBatch, arrow::error::ArrowError> {
let num_rows = input_batch.num_rows();
let mut token_builder = ListBuilder::new(Int64Builder::with_capacity(num_rows * 1024)); // Pre-allocate space
let mut should_break = false;
let column = input_batch.column_by_name("text").unwrap();
let string_array = column
.as_any()
.downcast_ref::<arrow::array::StringArray>()
.unwrap();
for i in 0..num_rows {
if self.cur_samples_cnt >= self.num_samples {
should_break = true;
break;
}
if !Array::is_null(string_array, i) {
let text = string_array.value(i);
// Split paragraph into lines
for line in text.split('
') {
if let Ok(encoding) = self.tokenizer.encode(line, true) {
let tb_values = token_builder.values();
for &id in encoding.get_ids() {
tb_values.append_value(id as i64);
}
token_builder.append(true);
self.cur_samples_cnt += 1;
if self.cur_samples_cnt % 5000 == 0 {
println!("Processed {} rows", self.cur_samples_cnt);
}
if self.cur_samples_cnt >= self.num_samples {
should_break = true;
break;
}
}
}
}
}
// Create array and shuffle it
let input_ids_array = token_builder.finish();
// Create shuffled array by randomly sampling indices
let mut rng = rand::rngs::StdRng::seed_from_u64(1337);
let len = input_ids_array.len();
let mut indices: Vec<u32> = (0..len as u32).collect();
indices.shuffle(&mut rng);
// Take values in shuffled order
let indices_array = UInt32Array::from(indices);
let shuffled = arrow::compute::take(&input_ids_array, &indices_array, None)?;
let batch = RecordBatch::try_new(self.schema.clone(), vec![Arc::new(shuffled)]);
if should_break {
println!("Stop at {} rows", self.cur_samples_cnt);
self.parquet_readers.clear();
self.current_reader = None;
}
batch
}
}
impl RecordBatchReader for WikiTextBatchReader {
fn schema(&self) -> Arc<Schema> {
self.schema.clone()
}
}
impl Iterator for WikiTextBatchReader {
type Item = Result<RecordBatch, arrow::error::ArrowError>;
fn next(&mut self) -> Option<Self::Item> {
loop {
// If we have a current reader, try to get next batch
if let Some(reader) = &mut self.current_reader {
if let Some(batch_result) = reader.next() {
return Some(batch_result.and_then(|batch| self.process_batch(&batch)));
}
}
// If no current reader or current reader is exhausted, try to get next reader
if self.current_reader_idx < self.parquet_readers.len() {
TITLE: Inefficient Row Update by Iteration
DESCRIPTION: Provides an example of an inefficient way to update multiple individual rows by iterating through a table and calling update
for each row. It notes that a merge insert operation is generally more efficient for bulk updates.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_6
LANGUAGE: python CODE:
import lance
# Change the ages of both Alice and Bob
new_table = pa.Table.from_pylist([{"name": "Alice", "age": 30},
{"name": "Bob", "age": 20}])
# This works, but is inefficient, see below for a better approach
dataset = lance.dataset("./alice_and_bob.lance")
for idx in range(new_table.num_rows):
name = new_table[0][idx].as_py()
new_age = new_table[1][idx].as_py()
dataset.update({"age": new_age}, where=f"name='{name}'")
TITLE: Generate and Merge Columns in Parallel with Ray and Lance
DESCRIPTION: This example illustrates how to generate new columns in parallel using Ray and Lance. It defines an Arrow schema, creates an initial dataset with 'id', 'height', and 'weight' columns, and then uses a custom Python function (generate_labels
) to add a new 'size_labels' column based on existing 'height' data, demonstrating Lance's add_columns
functionality for parallel processing.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/ray.md#_snippet_1
LANGUAGE: python CODE:
import pyarrow as pa
from pathlib import Path
import lance
# Define schema
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("height", pa.int64()),
pa.field("weight", pa.int64()),
])
# Generate initial dataset
ds = (
ray.data.range(10) # Create 0-9 IDs
.map(lambda x: {
"id": x["id"],
"height": x["id"] + 5, # height = id + 5
"weight": x["id"] * 2 # weight = id * 2
})
.write_lance(str(output_path), schema=schema)
)
# Define label generation logic
def generate_labels(batch: pa.RecordBatch) -> pa.RecordBatch:
heights = batch.column("height").to_pylist()
size_labels = ["tall" if h > 8 else "medium" if h > 6 else "short" for h in heights]
return pa.RecordBatch.from_arrays([
pa.array(size_labels)
], names=["size_labels"])
# Add new columns in parallel
lance_ds = lance.dataset(output_path)
add_columns(
lance_ds,
generate_labels,
source_columns=["height"], # Input columns needed
)
# Display final results
final_df = lance_ds.to_table().to_pandas()
print("\nEnhanced dataset with size labels:\n")
print(final_df.sort_values("id").to_string(index=False))
TITLE: Configure Python Benchmark for Single Iteration Tracing
DESCRIPTION: When tracing Python benchmarks, it's often useful to force them to run only once for sensible results. This snippet demonstrates how to use the pedantic
API to limit a benchmark to a single iteration and round, ensuring a focused trace.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_19
LANGUAGE: Python CODE:
def run():
"Put code to benchmark here"
...
benchmark.pedantic(run, iterations=1, rounds=1)
TITLE: Enable Tracing for Python Script
DESCRIPTION: To trace a Python script, import the trace_to_chrome
function from lance.tracing
and call it at the beginning of your script, specifying the desired tracing level. A single JSON trace file will be generated upon the script's exit, suitable for Chrome's trace viewer.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_18
LANGUAGE: Python CODE:
from lance.tracing import trace_to_chrome
trace_to_chrome(level="debug")
# rest of script
TITLE: LanceDB Encoding Metadata Key Specifications DESCRIPTION: This section provides a detailed specification of the metadata keys used in LanceDB for column-level encoding. It describes each key's type, purpose, example values, and how it's used in Python to configure data storage and optimization.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_8
LANGUAGE: APIDOC CODE:
Metadata Key Specifications:
- lance-encoding:compression
Type: Compression
Description: Specifies compression algorithm
Example Values: zstd
Example Usage (Python): metadata={"lance-encoding:compression": "zstd"}
- lance-encoding:compression-level
Type: Compression
Description: Zstd compression level (1-22)
Example Values: 3
Example Usage (Python): metadata={"lance-encoding:compression-level": "3"}
- lance-encoding:blob
Type: Storage
Description: Marks binary data (>4MB) for chunked storage
Example Values: true/false
Example Usage (Python): metadata={"lance-encoding:blob": "true"}
- lance-encoding:packed
Type: Optimization
Description: Struct memory layout optimization
Example Values: true/false
Example Usage (Python): metadata={"lance-encoding:packed": "true"}
- lance-encoding:structural-encoding
Type: Nested Data
Description: Encoding strategy for nested structures
Example Values: miniblock/fullzip
Example Usage (Python): metadata={"lance-encoding:structural-encoding": "miniblock"}
TITLE: Initialize Tokenizer and Load Wikitext Dataset (Python) DESCRIPTION: This snippet initializes a Hugging Face tokenizer (gpt2) and loads the wikitext-103-raw-v1 dataset in streaming mode. The 'streaming=True' argument is crucial for processing large datasets without downloading the entire dataset upfront, allowing samples to be downloaded as needed.
LANGUAGE: python CODE:
import lance
import pyarrow as pa
from datasets import load_dataset
from transformers import AutoTokenizer
from tqdm.auto import tqdm # optional for progress tracking
tokenizer = AutoTokenizer.from_pretrained('gpt2')
dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', streaming=True)['train']
dataset = dataset.shuffle(seed=1337)
TITLE: Example of Hierarchical Schema Definition DESCRIPTION: This snippet demonstrates a sample schema definition within the LanceDB data format, showcasing primitive types, nested structs, and lists. It illustrates how complex data structures are defined before being flattened into a field list for metadata representation.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_6
LANGUAGE: APIDOC CODE:
a: i32
b: struct {
c: list<i32>
d: i32
}
TITLE: Define Custom PyTorch Dataset for Lance Data
DESCRIPTION: The LanceDataset
class extends PyTorch's Dataset
to provide an interface for loading data from a Lance dataset. It initializes by loading the specified Lance dataset and setting a block_size
for token windows. The __len__
method calculates the total number of possible starting indices, while __getitem__
generates a window of indices and uses the from_indices
utility to load and return corresponding 'input_ids' and 'labels' as PyTorch tensors, forming a causal sample for LLM training.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_2
LANGUAGE: python CODE:
class LanceDataset(Dataset):
def __init__(
self,
dataset_path,
block_size,
):
# Load the lance dataset from the saved path
self.ds = lance.dataset(dataset_path)
self.block_size = block_size
# Doing this so the sampler never asks for an index at the end of text
self.length = self.ds.count_rows() - block_size
def __len__(self):
return self.length
def __getitem__(self, idx):
"""
Generate a window of indices starting from the current idx to idx+block_size
and return the tokens at those indices
"""
window = np.arange(idx, idx + self.block_size)
sample = from_indices(self.ds, window)
return {"input_ids": torch.tensor(sample), "labels": torch.tensor(sample)}
TITLE: Complex SQL Filter Expression for Lance Dataset
DESCRIPTION: This SQL snippet provides an example of a complex filter expression that can be pushed down to the Lance storage system. It demonstrates the use of IN
, AND
, OR
, NOT
, and nested field access for filtering data efficiently at the storage layer.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_16
LANGUAGE: sql CODE:
((label IN [10, 20]) AND (note['email'] IS NOT NULL))
OR NOT note['created']
TITLE: Tune ANN Search Parameters in LanceDB (nprobes, refine_factor) DESCRIPTION: This code demonstrates how to tune the performance of an Approximate Nearest Neighbor (ANN) search in LanceDB by adjusting 'nprobes' and 'refine_factor'. 'nprobes' controls the number of IVF partitions to search, while 'refine_factor' determines how many vectors are retrieved for re-ranking, balancing latency and recall.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_18
LANGUAGE: python CODE:
%%time
sift1m.to_table(
nearest={
"column": "vector",
"q": samples[0],
"k": 10,
"nprobes": 10,
"refine_factor": 5,
}
).to_pandas()
TITLE: Querying Lance Datasets with DuckDB in Python DESCRIPTION: This snippet demonstrates how to perform SQL queries on a Lance dataset using DuckDB in Python. It shows examples of selecting all data and calculating the mean of a column, illustrating DuckDB's direct access to Lance datasets via Arrow compatibility.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/duckdb.md#_snippet_0
LANGUAGE: Python CODE:
import duckdb # pip install duckdb
duckdb.query("SELECT * FROM my_lance_dataset")
# ┌─────────────┬─────────┬────────┐
# │ vector │ item │ price │
# │ float[] │ varchar │ double │
# ├─────────────┼─────────┼────────┤
# │ [3.1, 4.1] │ foo │ 10.0 │
# │ [5.9, 26.5] │ bar │ 20.0 │
# └─────────────┴─────────┴────────┘
duckdb.query("SELECT mean(price) FROM my_lance_dataset")
# ┌─────────────┐
# │ mean(price) │
# │ double │
# ├─────────────┤
# │ 15.0 │
# └─────────────┘
TITLE: Use Sharded Sampler with LanceDataset for Distributed Training
DESCRIPTION: This example illustrates how to integrate lance.sampler.ShardedFragmentSampler
with LanceDataset
to control the data sampling strategy for distributed training environments. It shows how to configure the sampler with the current process's rank and the total number of processes (world size) for sharded data access.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_3
LANGUAGE: python CODE:
from lance.sampler import ShardedFragmentSampler
from lance.torch.data import LanceDataset
# Load lance dataset into a PyTorch IterableDataset.
# with only columns "image" and "prompt".
dataset = LanceDataset(
"diffusiondb_train.lance",
columns=["image", "prompt"],
batch_size=128,
batch_readahead=8, # Control multi-threading reads.
sampler=ShardedFragmentSampler(
rank=1, # Rank of the current process
world_size=8, # Total number of processes
),
)
TITLE: Filter and Select Columns from Lance Dataset in TensorFlow DESCRIPTION: This example illustrates efficient data loading from Lance into TensorFlow by specifying desired columns and applying filter conditions. It leverages Lance's columnar format for optimized data retrieval, reducing memory and processing overhead.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_1
LANGUAGE: python CODE:
ds = lance.tf.data.from_lance(
"s3://my-bucket/my-dataset",
columns=["image", "label"],
filter="split = 'train' AND collected_time > timestamp '2020-01-01'",
batch_size=256)
TITLE: Python: Decode EncodedImageArray to FixedShapeImageTensorArray
DESCRIPTION: This Python example demonstrates how to load images from URIs into an ImageURIArray
, read them into an EncodedImageArray
, and then decode them into a FixedShapeImageTensorArray
. It also illustrates how to provide a custom TensorFlow-based decoder function for the to_tensor
method, allowing for flexible image processing.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_5
LANGUAGE: python CODE:
from lance.arrow import ImageURIArray
uris = [os.path.join(os.path.dirname(__file__), "images/1.png")]
encoded_images = ImageURIArray.from_uris(uris).read_uris()
print(encoded_images.to_tensor())
def tensorflow_decoder(images):
import tensorflow as tf
import numpy as np
return np.stack(tf.io.decode_png(img.as_py(), channels=3) for img in images.storage)
print(encoded_images.to_tensor(tensorflow_decoder))
TITLE: Add and Populate Columns with Python UDF in Lance
DESCRIPTION: Shows how to add and populate new columns in a Lance dataset using a custom Python function (UDF). The UDF processes data in batches, and the example includes using lance.batch_udf
with checkpointing for robust, expensive computations.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_2
LANGUAGE: python CODE:
import lance
import pyarrow as pa
import numpy as np
table = pa.table({"id": pa.array([1, 2, 3])})
dataset = lance.write_dataset(table, "ids")
@lance.batch_udf(checkpoint_file="embedding_checkpoint.sqlite")
def add_random_vector(batch):
embeddings = np.random.rand(batch.num_rows, 128).astype("float32")
return pd.DataFrame({"embedding": embeddings})
dataset.add_columns(add_random_vector)
TITLE: Construct OpenAI prompt with context
DESCRIPTION: Defines a function create_prompt
that takes a query and contextual information to build a structured prompt for a large language model. It dynamically appends context, ensuring the total prompt length stays within a specified token limit for the LLM.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_10
LANGUAGE: python CODE:
def create_prompt(query, context):
limit = 3750
prompt_start = (
"Answer the question based on the context below.\n\n"+
"Context:\n"
)
prompt_end = (
f"\n\nQuestion: {query}\nAnswer:"
)
# append contexts until hitting limit
for i in range(1, len(context)):
if len("\n\n---\n\n".join(context.text[:i])) >= limit:
prompt = (
prompt_start +
"\n\n---\n\n".join(context.text[:i-1]) +
prompt_end
)
break
elif i == len(context)-1:
prompt = (
prompt_start +
"\n\n---\n\n".join(context.text) +
prompt_end
)
return prompt
TITLE: Set DYLD_LIBRARY_PATH for Lance Python Debugging in LLDB
DESCRIPTION: Configures the DYLD_LIBRARY_PATH
environment variable specifically for debugging Lance Python projects within LLDB. This ensures that the dynamic linker can find necessary shared libraries located in the third-party distribution directory.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/Debug.md#_snippet_1
LANGUAGE: lldb CODE:
# /path/to/lance/python/.lldbinit
env DYLD_LIBRARY_PATH=/path/to/thirdparty/dist/lib:${DYLD_LIBRARY_PATH}
TITLE: Rename Top-Level Columns in LanceDB Dataset
DESCRIPTION: This snippet illustrates how to rename top-level columns in a LanceDB dataset using the lance.LanceDataset.alter_columns
method. It shows a simple example of changing a column name and verifying the change by printing the dataset as a Pandas DataFrame.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_5
LANGUAGE: python CODE:
table = pa.table({"id": pa.array([1, 2, 3])})
dataset = lance.write_dataset(table, "ids")
dataset.alter_columns({"path": "id", "name": "new_id"})
print(dataset.to_table().to_pandas())
# new_id
# 0 1
# 1 2
# 2 3
TITLE: Python: Encode FixedShapeImageTensorArray to EncodedImageArray
DESCRIPTION: This Python example shows how to convert a FixedShapeImageTensorArray
back into an EncodedImageArray
. It first obtains a tensor array by decoding an EncodedImageArray
(which was read from URIs) and then calls the to_encoded()
method. This process is useful for saving processed images back into a compressed format.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_6
LANGUAGE: python CODE:
from lance.arrow import ImageURIArray
uris = [image_uri]
tensor_images = ImageURIArray.from_uris(uris).read_uris().to_tensor()
tensor_images.to_encoded()
TITLE: Initialize LLM Training Environment with GPT2 and Lance DESCRIPTION: This snippet imports essential libraries for LLM training, including Lance, PyTorch, and Hugging Face Transformers. It initializes the GPT2 tokenizer and model from pre-trained weights. Key hyperparameters such as learning rate, epochs, block size, batch size, device, and the Lance dataset path are defined, preparing the environment for subsequent data loading and model training.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_0
LANGUAGE: python CODE:
import numpy as np
import lance
import torch
from torch.utils.data import Dataset, DataLoader, Sampler
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm.auto import tqdm
# We'll be training the pre-trained GPT2 model in this example
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Also define some hyperparameters
lr = 3e-4
nb_epochs = 10
block_size = 1024
batch_size = 8
device = 'cuda:0'
dataset_path = 'wikitext_500K.lance'
TITLE: Define context window and stride parameters
DESCRIPTION: Initializes window
and stride
variables for creating rolling contextual windows from text data. These parameters define the size of each context (number of sentences) and the step size for generating subsequent contexts, respectively.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_3
LANGUAGE: python CODE:
import numpy as np
import pandas as pd
window = 20
stride = 4
TITLE: Append New Fragments to an Existing Lance Dataset
DESCRIPTION: This example illustrates how to append new data to an existing Lance dataset. It retrieves the current dataset version, uses lance.LanceOperation.Append
with the collected fragments, and commits them, ensuring the read_version
is correctly set to maintain data consistency during the append operation.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_2
LANGUAGE: python CODE:
import lance
ds = lance.dataset(data_uri)
read_version = ds.version # record the read version
op = lance.LanceOperation.Append(schema, all_fragments)
lance.LanceDataset.commit(
data_uri,
op,
read_version=read_version,
)
TITLE: Extract Video Frames from Lance Blob Data in Python
DESCRIPTION: This Python example illustrates how to fetch and process large binary video data stored as blobs in a Lance dataset. It uses lance.dataset.LanceDataset.take_blobs
to retrieve a BlobFile
object, then leverages the av
library to open the video and extract frames within a specified time range without loading the entire video into memory.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/blob.md#_snippet_1
LANGUAGE: python CODE:
import av # pip install av
import lance
ds = lance.dataset("./youtube.lance")
start_time, end_time = 500, 1000
blobs = ds.take_blobs([5], "video")
with av.open(blobs[0]) as container:
stream = container.streams.video[0]
stream.codec_context.skip_frame = "NONKEY"
start_time = start_time / stream.time_base
start_time = start_time.as_integer_ratio()[0]
end_time = end_time / stream.time_base
container.seek(start_time, stream=stream)
for frame in container.decode(stream):
if frame.time > end_time:
break
display(frame.to_image())
clear_output(wait=True)
TITLE: Perform Approximate Nearest Neighbor (ANN) Search in LanceDB DESCRIPTION: This Python snippet shows how to perform an Approximate Nearest Neighbor (ANN) search on a LanceDB dataset with an existing index. It queries a specified vector column for the 'k' nearest neighbors to a given query vector 'q', measuring the average query time. The result is converted to a Pandas DataFrame for display.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/start/quickstart.md#_snippet_17
LANGUAGE: python CODE:
sift1m = lance.dataset(uri)
import time
tot = 0
for q in samples:
start = time.time()
tbl = sift1m.to_table(nearest={"column": "vector", "q": q, "k": 10})
end = time.time()
tot += (end - start)
print(f"Avg(sec): {tot / len(samples)}")
print(tbl.to_pandas())
TITLE: Cast Column Data Types in LanceDB Dataset
DESCRIPTION: This snippet explains how to change the data type of a column in a LanceDB dataset using lance.LanceDataset.alter_columns
. It notes that this operation rewrites only the affected column's data files and that any existing index on the column will be dropped. An example is provided for converting a float32 embedding column to float16 to save disk space.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_7
LANGUAGE: python CODE:
table = pa.table({
"id": pa.array([1, 2, 3]),
"embedding": pa.FixedShapeTensorArray.from_numpy_ndarray(
np.random.rand(3, 128).astype("float32"))
})
dataset = lance.write_dataset(table, "embeddings")
dataset.alter_columns({"path": "embedding",
"data_type": pa.list_(pa.float16(), 128)})
print(dataset.schema)
# id: int64
# embedding: fixed_size_list<item: halffloat>[128]
# child 0, item: halffloat
TITLE: Call OpenAI Completion API for text generation
DESCRIPTION: Defines the complete
function to interact with OpenAI's text-davinci-003
model. It sends a given prompt and retrieves the generated text completion, configuring parameters like temperature, max tokens, and presence/frequency penalties for desired output characteristics.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_11
LANGUAGE: python CODE:
def complete(prompt):
# query text-davinci-003
res = openai.Completion.create(
engine='text-davinci-003',
prompt=prompt,
temperature=0,
max_tokens=400,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=None
)
return res['choices'][0]['text'].strip()
# check that it works
query = "who was the 12th person on the moon and when did they land?"
complete(query)
TITLE: Build LanceDB Java Project with Maven DESCRIPTION: Provides the Maven command to clean and package the entire LanceDB Java project, including its dependencies and sub-modules. This command compiles the Java code and prepares it for deployment.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_9
LANGUAGE: shell CODE:
mvn clean package
TITLE: Import IPython.display for multimedia output
DESCRIPTION: Imports the YouTubeVideo
class from IPython.display
. This class is essential for embedding and displaying YouTube videos directly within an IPython or Jupyter environment, allowing for rich multimedia output.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_13
LANGUAGE: python CODE:
from IPython.display import YouTubeVideo
TITLE: Initialize CLIP Model Instances, Tokenizer, and PyTorch Optimizer
DESCRIPTION: This snippet initializes instances of the ImageEncoder
, TextEncoder
, and Head
modules, along with a Hugging Face AutoTokenizer
. It then sets up a PyTorch Adam
optimizer, explicitly defining separate learning rates for the image encoder, text encoder, and the combined head modules, preparing the model for training.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_7
LANGUAGE: python CODE:
# Define image encoder, image head, text encoder, text head and a tokenizer for tokenizing the caption
img_encoder = ImageEncoder(model_name=Config.img_encoder_model).to('cuda')
img_head = Head(Config.img_embed_dim, Config.projection_dim).to('cuda')
tokenizer = AutoTokenizer.from_pretrained(Config.text_encoder_model)
text_encoder = TextEncoder(model_name=Config.text_encoder_model).to('cuda')
text_head = Head(Config.text_embed_dim, Config.projection_dim).to('cuda')
# Since we are optimizing two different models together, we will define parameters manually
parameters = [
{"params": img_encoder.parameters(), "lr": Config.img_enc_lr},
{"params": text_encoder.parameters(), "lr": Config.text_enc_lr},
{
"params": itertools.chain(
img_head.parameters(),
text_head.parameters(),
),
"lr": Config.head_lr,
},
]
optimizer = torch.optim.Adam(parameters)
TITLE: Build vector index for LanceDB dataset DESCRIPTION: Creates an IVF_PQ (Inverted File Index with Product Quantization) index on the 'vector' column of the LanceDB dataset. This indexing significantly speeds up similarity search queries, making the retrieval of relevant contexts much faster.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_9
LANGUAGE: python CODE:
ds = ds.create_index("vector",
index_type="IVF_PQ",
num_partitions=64, # IVF
num_sub_vectors=96) # PQ
TITLE: Import necessary modules for Lance and PyTorch deep learning artifact management
DESCRIPTION: This snippet imports essential Python libraries required for deep learning artifact management using Lance. It includes os
and shutil
for file system operations, lance
for data storage, pyarrow
for schema definition, torch
for PyTorch model handling, and collections.OrderedDict
for managing model state dictionaries.
LANGUAGE: python CODE:
import os
import shutil
import lance
import pyarrow as pa
import torch
from collections import OrderedDict
TITLE: Download and extract MeCab Ipadic model DESCRIPTION: This snippet downloads the gzipped tarball of the MeCab Ipadic model from GitHub and then extracts its contents using tar. This is the first step in preparing the dictionary for building.
SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_0
LANGUAGE: bash CODE:
curl -L -o mecab-ipadic-2.7.0-20070801.tar.gz "https://github.com/lindera-morphology/mecab-ipadic/archive/refs/tags/2.7.0-20070801.tar.gz"
tar xvf mecab-ipadic-2.7.0-20070801.tar.gz
TITLE: Process Image Captions and Images for Lance Dataset in Python
DESCRIPTION: This Python function process
takes a list of image captions, reads corresponding image files, converts them to binary, and yields PyArrow RecordBatches. Each batch contains image_id
, binary image
data, and a list of captions
, preparing data for a Lance dataset. It handles FileNotFoundError
for missing images and uses tqdm
for progress indication.
LANGUAGE: python CODE:
def process(captions):
for img_id, img_captions in tqdm(captions):
try:
with open(os.path.join(image_folder, img_id), 'rb') as im:
binary_im = im.read()
except FileNotFoundError:
print(f"img_id '{img_id}' not found in the folder, skipping.")
continue
img_id = pa.array([img_id], type=pa.string())
img = pa.array([binary_im], type=pa.binary())
capt = pa.array([img_captions], pa.list_(pa.string(), -1))
yield pa.RecordBatch.from_arrays(
[img_id, img, capt],
["image_id", "image", "captions"]
)
TITLE: Create Empty Lance Dataset in Java
DESCRIPTION: This Java code demonstrates how to create a new, empty Lance dataset at a specified path. It defines the dataset's schema with 'id' (Int32) and 'name' (Utf8) fields, initializes a BufferAllocator
, and uses Dataset.create
to persist the schema. The snippet also shows how to access dataset version information immediately after creation.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_1
LANGUAGE: Java CODE:
void createDataset() throws IOException, URISyntaxException {
String datasetPath = tempDir.resolve("write_stream").toString();
Schema schema =
new Schema(
Arrays.asList(
Field.nullable("id", new ArrowType.Int(32, true)),
Field.nullable("name", new ArrowType.Utf8())),
null);
try (BufferAllocator allocator = new RootAllocator();) {
Dataset.create(allocator, datasetPath, schema, new WriteParams.Builder().build());
try (Dataset dataset = Dataset.create(allocator, datasetPath, schema, new WriteParams.Builder().build());) {
dataset.version();
dataset.latestVersion();
}
}
}
TITLE: Generate contextual text windows from video transcripts
DESCRIPTION: Defines the contextualize
function to create overlapping text contexts from video transcripts. It processes each video, combining sentences into windows based on window
and stride
parameters, and returns a new DataFrame with these generated contexts.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_4
LANGUAGE: python CODE:
def contextualize(raw_df, window, stride):
def process_video(vid):
# For each video, create the text rolling window
text = vid.text.values
time_end = vid["end"].values
contexts = vid.iloc[:-window:stride, :].copy()
contexts["text"] = [' '.join(text[start_i:start_i+window])
for start_i in range(0, len(vid)-window, stride)]
contexts["end"] = [time_end[start_i+window-1]
for start_i in range(0, len(vid)-window, stride)]
return contexts
# concat result from all videos
return pd.concat([process_video(vid) for _, vid in raw_df.groupby("title")])
df = contextualize(data.to_pandas(), 20, 4)
TITLE: Display answer and relevant YouTube video segment
DESCRIPTION: Executes the full Q&A pipeline: poses a query, retrieves the answer and relevant context, prints the generated answer, and then displays the most relevant YouTube video segment using YouTubeVideo
at the precise timestamp where the context was found.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_14
LANGUAGE: python CODE:
query = ("Which training method should I use for sentence transformers "
"when I only have pairs of related sentences?")
completion, context = answer(query)
print(completion)
top_match = context.iloc[0]
YouTubeVideo(top_match["url"].split("/")[-1], start=top_match["start"])
TITLE: Create LanceDB dataset from embeddings and contexts DESCRIPTION: Converts the generated embeddings into a LanceDB vector table and combines it with the original contextualized DataFrame. This process creates a new LanceDB dataset named 'chatbot.lance' on disk, ready for efficient vector search operations.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_8
LANGUAGE: python CODE:
import lance
import pyarrow as pa
from lance.vector import vec_to_table
table = vec_to_table(np.array(embeds))
combined = pa.Table.from_pandas(df).append_column("vector", table["vector"])
ds = lance.write_dataset(combined, "chatbot.lance")
TITLE: Create LanceDB Index for GIST-1M Dataset
DESCRIPTION: Builds an index on the GIST-1M Lance dataset using index.py
. The specified parameters for IVF partitions (-i) and PQ subvectors (-p) are crucial for optimizing query performance.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_11
LANGUAGE: sh CODE:
./index.py ./.lancedb/gist1m.lance -i 256 -p 120
TITLE: Generate Lance Dataset
DESCRIPTION: This command executes the datagen.py
script to create the Lance dataset required for the Cohere wiki text embedding benchmark.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/wiki/README.md#_snippet_0
LANGUAGE: bash CODE:
python datagen.py
TITLE: Generate answer using vector search and LLM DESCRIPTION: Combines embedding generation, LanceDB vector search, and prompt creation to answer a question. It first embeds the query, then finds the most relevant contexts using vector similarity in LanceDB, and finally uses an LLM to formulate an answer based on those retrieved contexts.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_12
LANGUAGE: python CODE:
def answer(question):
emb = embed_func(query)[0]
context = ds.to_table(
nearest={
"column": "vector",
"k": 3,
"q": emb,
"nprobes": 20,
"refine_factor": 100
}).to_pandas()
prompt = create_prompt(question, context)
return complete(prompt), context.reset_index()
TITLE: Create LanceDB Index for SIFT-1M Dataset
DESCRIPTION: Builds an index on the SIFT-1M Lance dataset using index.py
. The specified parameters for IVF partitions (-i) and PQ subvectors (-p) are crucial for optimizing query performance.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_6
LANGUAGE: sh CODE:
./index.py ./.lancedb/sift1m.lance -i 256 -p 16
TITLE: LanceDB Manifest Naming Schemes DESCRIPTION: Describes the V1 (legacy) and V2 (new) naming conventions for manifest files in LanceDB, emphasizing the V2 scheme's zero-padded, descending-sortable versioning for efficient latest manifest retrieval.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_10
LANGUAGE: APIDOC CODE:
Manifest Naming Schemes:
V1: _versions/{version}.manifest
V2: _versions/{u64::MAX - version:020}.manifest
TITLE: Initialize LanceDB Dataset and PyTorch DataLoader DESCRIPTION: This snippet demonstrates how to initialize a CLIPLanceDataset using a LanceDB file (flickr8k.lance) and then wrap it with a PyTorch DataLoader. It configures the dataset with tokenization and augmentations, and the dataloader for efficient batch processing during training.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_8
LANGUAGE: python CODE:
dataset = CLIPLanceDataset(
lance_path="flickr8k.lance",
max_len=Config.max_len,
tokenizer=tokenizer,
transforms=train_augments
)
dataloader = DataLoader(
dataset,
shuffle=False,
batch_size=Config.bs,
pin_memory=True
)
TITLE: Run GIST-1M Benchmark and Store Results
DESCRIPTION: Executes the benchmark for GIST-1M using metrics.py
, querying the indexed dataset with specified parameters like number of results to fetch (-k) and query vectors (-q). The results, including mean query time and recall@1, are saved to a CSV file.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_12
LANGUAGE: sh CODE:
./metrics.py ./.lancedb/gist1m.lance results-gist.csv -i 256 -p 120 -q ./.lancedb/gist_query.lance -k 1
TITLE: Run SIFT-1M Benchmark and Store Results
DESCRIPTION: Executes the benchmark for SIFT-1M using metrics.py
, querying the indexed dataset with specified parameters like number of results to fetch (-k) and query vectors (-q). The results, including mean query time and recall@1, are saved to a CSV file.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_7
LANGUAGE: sh CODE:
./metrics.py ./.lancedb/sift1m.lance results-sift.csv -i 256 -p 16 -q ./.lancedb/sift_query.lance -k 1
TITLE: Object Store General Configuration Options DESCRIPTION: Details configuration parameters applicable to all object stores, including network, security, and retry settings. These options control connection behavior, certificate validation, timeouts, user agents, proxy usage, and client-side retry logic.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_2
LANGUAGE: APIDOC CODE:
Key: allow_http
Description: Allow non-TLS, i.e. non-HTTPS connections. Default, False.
Key: download_retry_count
Description: Number of times to retry a download. Default, 3. This limit is applied when the HTTP request succeeds but the response is not fully downloaded, typically due to a violation of request_timeout.
Key: allow_invalid_certificates
Description: Skip certificate validation on https connections. Default, False. Warning: This is insecure and should only be used for testing.
Key: connect_timeout
Description: Timeout for only the connect phase of a Client. Default, 5s.
Key: request_timeout
Description: Timeout for the entire request, from connection until the response body has finished. Default, 30s.
Key: user_agent
Description: User agent string to use in requests.
Key: proxy_url
Description: URL of a proxy server to use for requests. Default, None.
Key: proxy_ca_certificate
Description: PEM-formatted CA certificate for proxy connections
Key: proxy_excludes
Description: List of hosts that bypass proxy. This is a comma separated list of domains and IP masks. Any subdomain of the provided domain will be bypassed. For example, example.com, 192.168.1.0/24 would bypass https://api.example.com, https://www.example.com, and any IP in the range 192.168.1.0/24.
Key: client_max_retries
Description: Number of times for a s3 client to retry the request. Default, 10.
Key: client_retry_timeout
Description: Timeout for a s3 client to retry the request in seconds. Default, 180.
TITLE: Import necessary libraries for CLIP model training DESCRIPTION: This snippet imports essential Python libraries like cv2, lance, numpy, torch, timm, and transformers, which are required for building and training a multi-modal CLIP model. It also includes utility libraries such as itertools and tqdm, and a warning filter.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_0
LANGUAGE: python CODE:
import cv2
import lance
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import timm
from transformers import AutoModel, AutoTokenizer
import itertools
from tqdm import tqdm
import warnings
warnings.simplefilter('ignore')
TITLE: Build user dictionary with Lindera DESCRIPTION: This command demonstrates how to build a custom user dictionary using 'lindera build'. It takes a CSV file as input and creates a new user dictionary, which can be used to extend the base language model.
SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_2
LANGUAGE: bash CODE:
lindera build --build-user-dictionary --dictionary-kind=ipadic user_dict/userdict.csv user_dict2
TITLE: Google Cloud Storage Configuration Keys
DESCRIPTION: Reference for configuration keys available for Google Cloud Storage when used with LanceDB. These keys can be set as environment variables or within the storage_options
parameter.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_8
LANGUAGE: APIDOC CODE:
Key / Environment Variable | Description
--------------------------|------------
google_service_account / service_account | Path to the service account JSON file.
google_service_account_key / service_account_key | The serialized service account key.
google_application_credentials / application_credentials | Path to the application credentials.
TITLE: Load YouTube transcription dataset DESCRIPTION: Downloads and loads the 'jamescalam/youtube-transcriptions' dataset from Hugging Face datasets. The 'train' split is specified to retrieve the main training portion of the data.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_1
LANGUAGE: python CODE:
from datasets import load_dataset
data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data
TITLE: Index Lance Data for Benchmarking
DESCRIPTION: This command runs the index.py
script to build an index on the generated Lance dataset. It configures the index with an L2 metric, 2048 partitions, and 96 sub-vectors for optimized benchmarking.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/wiki/README.md#_snippet_1
LANGUAGE: bash CODE:
python index.py --metric L2 --num-partitions 2048 --num-sub-vectors 96
TITLE: Jieba User Dictionary Configuration File (config.json)
DESCRIPTION: JSON configuration for Jieba user dictionaries. This file, named config.json
, specifies an optional 'main' dictionary and an array of paths to additional 'users' dictionary files. It should be placed in the model's root directory.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_3
LANGUAGE: json CODE:
{
"main": "dict.txt",
"users": ["path/to/user/dict.txt"]
}
TITLE: Batch and generate embeddings using OpenAI API
DESCRIPTION: Configures the OpenAI API key and defines a to_batches
helper function for processing data in chunks. It then uses this to generate embeddings for the contextualized text in batches, improving efficiency and adhering to API best practices by reducing individual API calls.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_7
LANGUAGE: python CODE:
from tqdm.auto import tqdm
import math
openai.api_key = "sk-..."
# We request in batches rather than 1 embedding at a time
def to_batches(arr, batch_size):
length = len(arr)
def _chunker(arr):
for start_i in range(0, len(df), batch_size):
yield arr[start_i:start_i+batch_size]
# add progress meter
yield from tqdm(_chunker(arr), total=math.ceil(length / batch_size))
batch_size = 1000
batches = to_batches(df.text.values.tolist(), batch_size)
embeds = [emb for c in batches for emb in rate_limited(c)]
TITLE: Download Jieba Language Model DESCRIPTION: Command-line instruction to download the Jieba language model for use with LanceDB. The model will be automatically stored in the default Jieba model directory within the configured language model home.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_1
LANGUAGE: bash CODE:
python -m lance.download jieba
TITLE: Read and Inspect Lance Dataset in Rust
DESCRIPTION: This Rust function read_dataset
shows how to open an existing Lance dataset from a given path. It uses a scanner
to create a batch_stream
and then iterates through each RecordBatch
, printing its number of rows, columns, schema, and the entire batch content.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_1
LANGUAGE: Rust CODE:
// Reads dataset from the given path and prints batch size, schema for all record batches. Also extracts and prints a slice from the first batch
async fn read_dataset(data_path: &str) {
let dataset = Dataset::open(data_path).await.unwrap();
let scanner = dataset.scan();
let mut batch_stream = scanner.try_into_stream().await.unwrap().map(|b| b.unwrap());
while let Some(batch) = batch_stream.next().await {
println!("Batch size: {}, {}", batch.num_rows(), batch.num_columns()); // print size of batch
println!("Schema: {:?}", batch.schema()); // print schema of recordbatch
println!("Batch: {:?}", batch); // print the entire recordbatch (schema and data)
}
} // End read dataset
TITLE: Define configuration class for CLIP model hyperparameters
DESCRIPTION: This Python class, Config
, centralizes all hyperparameters necessary for training the CLIP model. It includes image and text dimensions, learning rates for different components, batch size, maximum sequence length, projection dimensions, temperature, number of epochs, and the names of the image and text encoder models.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_1
LANGUAGE: python CODE:
class Config:
img_size = (128, 128)
bs = 32
head_lr = 1e-3
img_enc_lr = 1e-4
text_enc_lr = 1e-5
max_len = 18
img_embed_dim = 2048
text_embed_dim = 768
projection_dim = 256
temperature = 1.0
num_epochs = 2
img_encoder_model = 'resnet50'
text_encoder_model = 'bert-base-cased'
TITLE: LanceDB External Manifest Store Reader Operations DESCRIPTION: Explains the reader's load process when an external manifest store is in use, including retrieving the manifest path, reattempting synchronization if needed, and ensuring the dataset remains portable.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_13
LANGUAGE: APIDOC CODE:
External Store Reader Load Process:
1. GET_EXTERNAL_STORE base_uri, version, path
- Action: Retrieve manifest path from external store.
- Condition: If path does not end in UUID, return path.
2. COPY_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid} mydataset.lance/_versions/{version}.manifest
- Action: Reattempt synchronization (copy staged to final).
3. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest
- Action: Update external store to point to final manifest.
4. RETURN mydataset.lance/_versions/{version}.manifest
- Action: Always return the finalized path.
- Error: Return error if synchronization fails.
TITLE: Generate Text-to-Image 10M Dataset in Lance Format DESCRIPTION: This snippet demonstrates how to create the 'text2image-10m' dataset in Lance format using scripts from the 'big-ann-benchmarks' repository. Upon execution, it generates two Lance datasets: a base dataset and a corresponding queries/ground truth dataset, essential for benchmarking.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/bigann/README.md#_snippet_1
LANGUAGE: bash CODE:
python ./big-ann-benchmarks/create_dataset.py --dataset yfcc-10M
./dataset.py -t text2image-10m data/text2image1B
TITLE: Run Flat Index Search Benchmark
DESCRIPTION: Executes the benchmark script to measure performance of flat index search. This command generates benchmark.csv
for raw data and benchmark.html
for latency plots.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/flat/README.md#_snippet_0
LANGUAGE: Shell CODE:
./benchmark.py
TITLE: PyTorch Model Training Loop with LanceDB DataLoader
DESCRIPTION: This snippet illustrates a complete PyTorch training loop. It begins by defining a LanceDataset
and LanceSampler
to efficiently load data, then sets up a DataLoader
. The code proceeds to initialize a PyTorch model and an AdamW optimizer. The core of the snippet is the epoch-based training loop, which includes iterating through batches, performing forward and backward passes, calculating loss, updating model parameters, and reporting training perplexity.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_4
LANGUAGE: python CODE:
dataset = LanceDataset(dataset_path, block_size)
sampler = LanceSampler(dataset, block_size)
dataloader = DataLoader(
dataset,
shuffle=False,
batch_size=batch_size,
sampler=sampler,
pin_memory=True
)
# Define the optimizer, training loop and train the model!
model = model.to(device)
model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
for epoch in range(nb_epochs):
print(f"========= Epoch: {epoch+1} / {nb_epochs} ========")
epoch_loss = []
prog_bar = tqdm(dataloader, total=len(dataloader))
for batch in prog_bar:
optimizer.zero_grad(set_to_none=True)
# Put both input_ids and labels to the device
for k, v in batch.items():
batch[k] = v.to(device)
# Perform one forward pass and get the loss
outputs = model(**batch)
loss = outputs.loss
# Perform backward pass
loss.backward()
optimizer.step()
prog_bar.set_description(f"loss: {loss.item():.4f}")
epoch_loss.append(loss.item())
# Calculate training perplexity for this epoch
try:
perplexity = np.exp(np.mean(epoch_loss))
except OverflowError:
perplexity = float("-inf")
print(f"train_perplexity: {perplexity}")
TITLE: Create PyArrow RecordBatchReader from Processed Samples (Python) DESCRIPTION: This code creates a PyArrow RecordBatchReader, which acts as an iterator over the data generated by the 'process_samples' function. It uses the defined schema to ensure data consistency and prepares the stream of record batches for writing to a Lance dataset.
LANGUAGE: python CODE:
reader = pa.RecordBatchReader.from_batches(
schema,
process_samples(dataset, num_samples=500_000, field='text') # For 500K samples
)
TITLE: Download and Extract GIST-1M Dataset DESCRIPTION: Downloads the GIST-1M dataset archive from the specified FTP server and extracts its contents. This is a prerequisite for generating Lance datasets for GIST-1M.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_8
LANGUAGE: sh CODE:
wget ftp://ftp.irisa.fr/local/texmex/corpus/gist.tar.gz
tar -xzf gist.tar.gz
TITLE: Create a Lance Dataset from Arrow RecordBatches in Rust DESCRIPTION: Demonstrates how to write a collection of Arrow RecordBatches and an Arrow Schema into a new Lance Dataset. It uses default write parameters and an iterator for the batches.
SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_1
LANGUAGE: rust CODE:
use lance::{dataset::WriteParams, Dataset};
let write_params = WriteParams::default();
let mut reader = RecordBatchIterator::new(
batches.into_iter().map(Ok),
schema
);
Dataset::write(reader, &uri, Some(write_params)).await.unwrap();
TITLE: Create TensorFlow Dataset from Lance URI
DESCRIPTION: This snippet demonstrates how to initialize a tf.data.Dataset
directly from a Lance dataset URI using lance.tf.data.from_lance
. It also shows how to chain standard TensorFlow dataset operations like shuffling and mapping for data preprocessing.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_0
LANGUAGE: python CODE:
import tensorflow as tf
import lance
# Create tf dataset
ds = lance.tf.data.from_lance("s3://my-bucket/my-dataset")
# Chain tf dataset with other tf primitives
for batch in ds.shuffling(32).map(lambda x: tf.io.decode_png(x["image"])):
print(batch)
TITLE: Write PyArrow Record Batches to Lance Dataset in Python
DESCRIPTION: This Python code demonstrates how to write PyArrow Record Batches to a Lance dataset. It creates a RecordBatchReader
from the defined schema and the output of the process
function, then uses lance.write_dataset
to efficiently persist the data to a file named 'flickr8k.lance' on disk.
LANGUAGE: python CODE:
reader = pa.RecordBatchReader.from_batches(schema, process(captions))
lance.write_dataset(reader, "flickr8k.lance", schema)
TITLE: Implement PyTorch CLIP Model Training Loop DESCRIPTION: This code defines the core training loop for a CLIP model. It sets all model components to training mode, iterates through epochs and batches from the DataLoader, performs forward and backward passes, calculates loss, and updates model weights using an optimizer. A progress bar provides real-time feedback on the training process.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_9
LANGUAGE: python CODE:
img_encoder.train()
img_head.train()
text_encoder.train()
text_head.train()
for epoch in range(Config.num_epochs):
print(f"{'='*20} Epoch: {epoch+1} / {Config.num_epochs} {'='*20}")
prog_bar = tqdm(dataloader)
for img, caption in prog_bar:
optimizer.zero_grad(set_to_none=True);
img_embed, text_embed = forward(img, caption)
loss = loss_fn(img_embed, text_embed, temperature=Config.temperature).mean()
loss.backward()
optimizer.step()
prog_bar.set_description(f"loss: {loss.item():.4f}")
print()
TITLE: Build Ipadic language model with Lindera DESCRIPTION: This command uses the 'lindera build' tool to compile the Ipadic dictionary. It specifies the dictionary kind as 'ipadic' and points to the extracted model directory to create the main dictionary.
SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_1
LANGUAGE: bash CODE:
lindera build --dictionary-kind=ipadic mecab-ipadic-2.7.0-20070801 main
TITLE: Write Lance Dataset in Rust
DESCRIPTION: This Rust function write_dataset
demonstrates how to create and write a Lance dataset to a specified path. It defines a schema with UInt32
fields, creates a RecordBatch
with sample data, and uses WriteParams
to set the write mode to Overwrite
before writing the dataset to disk.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_0
LANGUAGE: Rust CODE:
// Writes sample dataset to the given path
async fn write_dataset(data_path: &str) {
// Define new schema
let schema = Arc::new(Schema::new(vec![
Field::new("key", DataType::UInt32, false),
Field::new("value", DataType::UInt32, false),
]));
// Create new record batches
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(UInt32Array::from(vec![1, 2, 3, 4, 5, 6])),
Arc::new(UInt32Array::from(vec![6, 7, 8, 9, 10, 11])),
],
)
.unwrap();
let batches = RecordBatchIterator::new([Ok(batch)], schema.clone());
// Define write parameters (e.g. overwrite dataset)
let write_params = WriteParams {
mode: WriteMode::Overwrite,
..Default::default()
};
Dataset::write(batches, data_path, Some(write_params))
.await
.unwrap();
} // End write dataset
TITLE: Download and Extract SIFT-1M Dataset DESCRIPTION: Downloads the SIFT-1M dataset archive from the specified FTP server and extracts its contents. This is a prerequisite for generating Lance datasets for SIFT-1M.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_1
LANGUAGE: sh CODE:
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz
TITLE: Query Lance Dataset with DuckDB DESCRIPTION: Demonstrates querying a Lance dataset directly using DuckDB. It highlights the integration with DuckDB for SQL-based data exploration and retrieval, enabling powerful analytical queries.
SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_5
LANGUAGE: python CODE:
import duckdb
# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()
TITLE: Build LanceDB Rust JNI Module DESCRIPTION: Specifies the command to build only the Rust-based JNI (Java Native Interface) module of LanceDB. This is useful for developers focusing on the native components without rebuilding the entire Java project.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_10
LANGUAGE: shell CODE:
cargo build
TITLE: Initialize Lance Dataset from Local Path DESCRIPTION: This Python snippet demonstrates how to initialize a Lance dataset object from a local file path. It sets up the dataset for subsequent read operations, enabling access to the data stored in the specified Lance file.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_12
LANGUAGE: python CODE:
ds = lance.dataset("./imagenet.lance")
TITLE: Implement custom PyTorch Dataset for Lance-based CLIP training
DESCRIPTION: This CLIPLanceDataset
class extends PyTorch's Dataset
to handle Lance datasets for CLIP model training. It initializes with a Lance dataset path, an optional tokenizer, and image transformations, providing methods to retrieve pre-processed images and tokenized captions for use in a DataLoader.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_3
LANGUAGE: python CODE:
class CLIPLanceDataset(Dataset):
"""Custom Dataset to load images and their corresponding captions"""
def __init__(self, lance_path, max_len=18, tokenizer=None, transforms=None):
self.ds = lance.dataset(lance_path)
self.max_len = max_len
# Init a new tokenizer if not specified already
self.tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') if not tokenizer else tokenizer
self.transforms = transforms
def __len__(self):
return self.ds.count_rows()
def __getitem__(self, idx):
# Load the image and caption
img = load_image(self.ds, idx)
caption = load_caption(self.ds, idx)
# Apply transformations to the images
if self.transforms:
img = self.transforms(img)
# Tokenize the caption
caption = self.tokenizer(
caption,
truncation=True,
padding='max_length',
max_length=self.max_len,
return_tensors='pt'
)
# Flatten each component of tokenized caption otherwise they will cause size mismatch errors during training
caption = {k: v.flatten() for k, v in caption.items()}
return img, caption
TITLE: Azure Blob Storage Configuration Keys
DESCRIPTION: Reference for configuration keys available for Azure Blob Storage when used with LanceDB. These keys can be set as environment variables or within the storage_options
parameter.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_10
LANGUAGE: APIDOC CODE:
Key / Environment Variable | Description
--------------------------|------------
azure_storage_account_name / account_name | The name of the azure storage account.
azure_storage_account_key / account_key | The serialized service account key.
azure_client_id / client_id | Service principal client id for authorizing requests.
azure_client_secret / client_secret | Service principal client secret for authorizing requests.
azure_tenant_id / tenant_id | Tenant id used in oauth flows.
azure_storage_sas_key / azure_storage_sas_token / sas_key / sas_token | Shared access signature. The signature is expected to be percent-encoded, much like they are provided in the azure storage explorer or azure portal.
azure_storage_token / bearer_token / token | Bearer token.
azure_storage_use_emulator / object_store_use_emulator / use_emulator | Use object store with azurite storage emulator.
azure_endpoint / endpoint | Override the endpoint used to communicate with blob storage.
azure_use_fabric_endpoint / use_fabric_endpoint | Use object store with url scheme account.dfs.fabric.microsoft.com.
azure_msi_endpoint / azure_identity_endpoint / identity_endpoint / msi_endpoint | Endpoint to request a imds managed identity token.
azure_object_id / object_id | Object id for use with managed identity authentication.
azure_msi_resource_id / msi_resource_id | Msi resource id for use with managed identity authentication.
azure_federated_token_file / federated_token_file | File containing token for Azure AD workload identity federation.
azure_use_azure_cli / use_azure_cli | Use azure cli for acquiring access token.
azure_disable_tagging / disable_tagging | Disables tagging objects. This can be desirable if not supported by the backing store.
TITLE: Define Function to Process and Tokenize Samples for Lance (Python) DESCRIPTION: This function iterates over a dataset, tokenizes individual samples using the 'tokenize' function, and yields PyArrow RecordBatches. It processes a specified number of samples, skipping empty ones, and is designed to efficiently prepare data for writing to a Lance dataset.
LANGUAGE: python CODE:
def process_samples(dataset, num_samples=100_000, field='text'):
current_sample = 0
for sample in tqdm(dataset, total=num_samples):
# If we have added all 5M samples, stop
if current_sample == num_samples:
break
if not sample[field]:
continue
# Tokenize the current sample
tokenized_sample = tokenize(sample, field)
# Increment the counter
current_sample += 1
# Yield a PyArrow RecordBatch
yield pa.RecordBatch.from_arrays(
[tokenized_sample],
names=["input_ids"]
)
TITLE: Read a Lance Dataset and Collect RecordBatches in Rust DESCRIPTION: Opens an existing Lance Dataset from a specified path, scans its content, and collects all resulting RecordBatches into a vector. Error handling is included.
SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_2
LANGUAGE: rust CODE:
let dataset = Dataset::open(path).await.unwrap();
let mut scanner = dataset.scan();
let batches: Vec<RecordBatch> = scanner
.try_into_stream()
.await
.unwrap()
.map(|b| b.unwrap())
.collect::<Vec<RecordBatch>>()
.await;
TITLE: Visualize Latency vs. NProbes with IVF and PQ DESCRIPTION: This snippet generates a scatter plot using seaborn to visualize the relationship between 'nprobes' and '50%' (median response time). It uses 'ivf' for color encoding and 'pq' for marker style, allowing for a multi-dimensional analysis of performance.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_7
LANGUAGE: python CODE:
sns.scatterplot(data=df, x="nprobes", y="50%", hue="ivf", style="pq")
TITLE: Write HuggingFace Dataset to Lance Format
DESCRIPTION: This Python code snippet demonstrates how to load a HuggingFace dataset and write it into the Lance format. It uses the datasets
library to load a specific split of a dataset and then lance.write_dataset
to save it as a Lance file. Dependencies include datasets
and lance
.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/huggingface.md#_snippet_0
LANGUAGE: python CODE:
import datasets # pip install datasets
import lance
lance.write_dataset(datasets.load_dataset(
"poloclub/diffusiondb", split="train[:10]"
), "diffusiondb_train.lance")
TITLE: Describe Median Latency by PQ Configuration DESCRIPTION: This snippet groups the DataFrame by the 'pq' column and calculates descriptive statistics for the '50%' (median response time) column. This provides insights into latency performance based on different Product Quantization (PQ) configurations.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_4
LANGUAGE: python CODE:
df.groupby("pq")["50%"].describe()
TITLE: Check number of generated contexts
DESCRIPTION: Prints the total number of contextualized entries created after processing the dataset. This helps verify the output of the contextualize
function and understand the volume of data prepared for embedding.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_5
LANGUAGE: python CODE:
len(df)
TITLE: Convert HuggingFace Dataset to LanceDB
DESCRIPTION: This snippet demonstrates how to load a dataset from HuggingFace and convert it into a Lance dataset using lance.write_dataset
. This is a foundational step for preparing data to be used with LanceDB's PyTorch integration.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_0
LANGUAGE: python CODE:
import datasets # pip install datasets
import lance
hf_ds = datasets.load_dataset(
"poloclub/diffusiondb",
split="train",
# name="2m_first_1k", # for a smaller subset of the dataset
)
lance.write_dataset(hf_ds, "diffusiondb_train.lance")
TITLE: Build IVF_PQ Vector Index on Lance Dataset DESCRIPTION: Creates an IVF_PQ (Inverted File Index with Product Quantization) index on the 'vector' column of the Lance dataset. This index significantly speeds up nearest neighbor searches by efficiently partitioning and quantizing the vector space.
SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_8
LANGUAGE: python CODE:
sift1m.create_index("vector",
index_type="IVF_PQ",
num_partitions=256, # IVF
num_sub_vectors=16) # PQ
TITLE: LanceDB S3 Storage Options Reference
DESCRIPTION: Reference for available keys in the storage_options
parameter for S3 and S3-compatible storage configurations in LanceDB. These options can be set via environment variables or directly in the storage_options
dictionary, controlling aspects like region, endpoint, and encryption.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_4
LANGUAGE: APIDOC CODE:
S3 Storage Options:
- aws_region / region: The AWS region the bucket is in. This can be automatically detected when using AWS S3, but must be specified for S3-compatible stores.
- aws_access_key_id / access_key_id: The AWS access key ID to use.
- aws_secret_access_key / secret_access_key: The AWS secret access key to use.
- aws_session_token / session_token: The AWS session token to use.
- aws_endpoint / endpoint: The endpoint to use for S3-compatible stores.
- aws_virtual_hosted_style_request / virtual_hosted_style_request: Whether to use virtual hosted-style requests, where bucket name is part of the endpoint. Meant to be used with `aws_endpoint`. Default, `False`.
- aws_s3_express / s3_express: Whether to use S3 Express One Zone endpoints. Default, `False`. See more details below.
- aws_server_side_encryption: The server-side encryption algorithm to use. Must be one of `"AES256"`, `"aws:kms"`, or `"aws:kms:dsse"`. Default, `None`.
- aws_sse_kms_key_id: The KMS key ID to use for server-side encryption. If set, `aws_server_side_encryption` must be `"aws:kms"` or `"aws:kms:dsse"`.
- aws_sse_bucket_key_enabled: Whether to use bucket keys for server-side encryption.
TITLE: Define OpenAI embedding function with rate limiting and retry
DESCRIPTION: Sets up an embedding function using OpenAI's text-embedding-ada-002
model. It incorporates ratelimiter
to respect API rate limits and retry
for robust API calls, ensuring successful embedding generation even with transient network issues.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_6
LANGUAGE: python CODE:
import functools
import openai
import ratelimiter
from retry import retry
embed_model = "text-embedding-ada-002"
# API limit at 60/min == 1/sec
limiter = ratelimiter.RateLimiter(max_calls=0.9, period=1.0)
# Get the embedding with retry
@retry(tries=10, delay=1, max_delay=30, backoff=3, jitter=1)
def embed_func(c):
rs = openai.Embedding.create(input=c, engine=embed_model)
return [record["embedding"] for record in rs["data"]]
rate_limited = limiter(embed_func)
TITLE: Add Lance SDK Java Maven Dependency
DESCRIPTION: This snippet provides the Maven XML configuration required to include the LanceDB Java SDK as a dependency in your project. It specifies the groupId
, artifactId
, and version
for the lance-core
library, enabling access to LanceDB functionalities.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_0
LANGUAGE: Shell CODE:
<dependency>
<groupId>com.lancedb</groupId>
<artifactId>lance-core</artifactId>
<version>0.18.0</version>
</dependency>
TITLE: Define PyArrow Schema for Lance Dataset (Python) DESCRIPTION: This snippet defines a PyArrow schema required by Lance to understand the structure of the data being written. It specifies that the dataset will contain a single field named 'input_ids', which will store tokenized data as 64-bit integers.
LANGUAGE: python CODE:
schema = pa.schema([
pa.field("input_ids", pa.int64())
])
TITLE: Add Columns to LanceDB Dataset in Java DESCRIPTION: Demonstrates how to add new columns to a LanceDB dataset. This can be done either by providing SQL expressions to derive new column values or by defining a new Arrow Schema for the dataset, allowing for flexible schema evolution.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_6
LANGUAGE: java CODE:
void addColumnsByExpressions() {
String datasetPath = ""; // specify a path point to a dataset
try (BufferAllocator allocator = new RootAllocator()) {
try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
SqlExpressions sqlExpressions = new SqlExpressions.Builder().withExpression("double_id", "id * 2").build();
dataset.addColumns(sqlExpressions, Optional.empty());
}
}
}
LANGUAGE: java CODE:
void addColumnsBySchema() {
String datasetPath = ""; // specify a path point to a dataset
try (BufferAllocator allocator = new RootAllocator()) {
try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
SqlExpressions sqlExpressions = new SqlExpressions.Builder().withExpression("double_id", "id * 2").build();
dataset.addColumns(new Schema(
Arrays.asList(
Field.nullable("id", new ArrowType.Int(32, true)),
Field.nullable("name", new ArrowType.Utf8()),
Field.nullable("age", new ArrowType.Int(32, true)))), Optional.empty());
}
}
}
TITLE: Write Processed Data to Lance Dataset (Python) DESCRIPTION: This final step uses the 'lance.write_dataset' function to persist the processed and tokenized data to disk as a Lance dataset. It takes the RecordBatchReader, the desired output file path, and the defined schema as arguments, completing the dataset creation process.
LANGUAGE: python CODE:
# Write the dataset to disk
lance.write_dataset(
reader,
"wikitext_500K.lance",
schema
)
TITLE: Create a Vector Index on a Lance Dataset in Rust DESCRIPTION: Demonstrates how to create a vector index on a specified column (e.g., 'embeddings') within a Lance Dataset. It configures vector index parameters like the number of partitions and sub-vectors, noting potential alignment requirements.
SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_4
LANGUAGE: rust CODE:
use ::lance::index::vector::VectorIndexParams;
let params = VectorIndexParams::default();
params.num_partitions = 256;
params.num_sub_vectors = 16;
// this will Err if list_size(embeddings) / num_sub_vectors does not meet simd alignment
dataset.create_index(&["embeddings"], IndexType::Vector, None, ¶ms, true).await;
TITLE: Load Query Data with Pandas DESCRIPTION: This snippet imports the pandas library and loads query performance data from a CSV file named 'query.csv' into a DataFrame. This DataFrame will be used for subsequent analysis and visualization.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_0
LANGUAGE: python CODE:
import pandas as pd
df = pd.read_csv("query.csv")
TITLE: Query Lance Dataset with Pandas DESCRIPTION: Illustrates how to convert a Lance dataset to a PyArrow Table and then to a Pandas DataFrame for easy data manipulation and analysis using familiar Pandas operations.
SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_4
LANGUAGE: python CODE:
df = dataset.to_table().to_pandas()
df
TITLE: Lance Manifest Protobuf Message Reference DESCRIPTION: References the Protobuf message definition for the Manifest file, which encapsulates the metadata for a specific version of a Lance dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_1
LANGUAGE: APIDOC CODE:
proto.message.Manifest
TITLE: Define Tokenization Function (Python) DESCRIPTION: This function takes a single sample from a Hugging Face dataset and a specified field name (e.g., 'text'). It uses the pre-initialized tokenizer to convert the text content of that field into 'input_ids', which are numerical representations of tokens.
LANGUAGE: python CODE:
def tokenize(sample, field='text'):
return tokenizer(sample[field])['input_ids']
TITLE: Implement CLIP Loss Function and Forward Pass Utilities
DESCRIPTION: This snippet provides utility functions for training a CLIP model. The loss_fn
calculates the contrastive loss between image and text embeddings based on the CLIP paper, using logits, image similarity, and text similarity. The forward
function performs a single forward pass, moving inputs to the GPU, and obtaining image and text embeddings using the defined encoder and head modules.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_6
LANGUAGE: python CODE:
def loss_fn(img_embed, text_embed, temperature=0.2):
"""
https://arxiv.org/abs/2103.00020/
"""
# Calculate logits, image similarity and text similarity
logits = (text_embed @ img_embed.T) / temperature
img_sim = img_embed @ img_embed.T
text_sim = text_embed @ text_embed.T
# Calculate targets by taking the softmax of the similarities
targets = F.softmax(
(img_sim + text_sim) / 2 * temperature, dim=-1
)
img_loss = (-targets.T * nn.LogSoftmax(dim=-1)(logits.T)).sum(1)
text_loss = (-targets * nn.LogSoftmax(dim=-1)(logits)).sum(1)
return (img_loss + text_loss) / 2.0
def forward(img, caption):
# Transfer to device
img = img.to('cuda')
for k, v in caption.items():
caption[k] = v.to('cuda')
# Get embeddings for both img and caption
img_embed = img_head(img_encoder(img))
text_embed = text_head(text_encoder(caption))
return img_embed, text_embed
TITLE: Read Data from Lance Dataset DESCRIPTION: Shows how to open and read a Lance dataset from a specified URI. It asserts that the returned object is a PyArrow Dataset, confirming seamless integration with the Apache Arrow ecosystem.
SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_3
LANGUAGE: python CODE:
dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)
TITLE: Globally Set Object Store Timeout (Bash) DESCRIPTION: Demonstrates how to set a global timeout for object store operations using an environment variable. This configuration applies to all subsequent Lance operations that interact with object storage.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_0
LANGUAGE: bash CODE:
export TIMEOUT=60s
TITLE: Lance File Format Version Details DESCRIPTION: This table provides a comprehensive overview of the Lance file format versions, including their compatibility, features, and stability status. It details the breaking changes, new functionalities introduced in each version, and aliases for common use cases.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_5
LANGUAGE: APIDOC CODE:
Version: 0.1
Minimal Lance Version: Any
Maximum Lance Version: Any
Description: This is the initial Lance format.
Version: 2.0
Minimal Lance Version: 0.16.0
Maximum Lance Version: Any
Description: Rework of the Lance file format that removed row groups and introduced null support for lists, fixed size lists, and primitives
Version: 2.1 (unstable)
Minimal Lance Version: None
Maximum Lance Version: Any
Description: Enhances integer and string compression, adds support for nulls in struct fields, and improves random access performance with nested fields.
Version: legacy
Minimal Lance Version: N/A
Maximum Lance Version: N/A
Description: Alias for 0.1
Version: stable
Minimal Lance Version: N/A
Maximum Lance Version: N/A
Description: Alias for the latest stable version (currently 2.0)
Version: next
Minimal Lance Version: N/A
Maximum Lance Version: N/A
Description: Alias for the latest unstable version (currently 2.1)
TITLE: Connect LanceDB to S3-Compatible Stores (e.g., MinIO)
DESCRIPTION: Illustrates how to configure LanceDB to connect to S3-compatible storage solutions like MinIO. This requires specifying both the region
and endpoint
within the storage_options
parameter to direct LanceDB to the custom S3 endpoint, enabling connectivity beyond AWS S3.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_5
LANGUAGE: python CODE:
import lance
ds = lance.dataset(
"s3://bucket/path",
storage_options={
"region": "us-east-1",
"endpoint": "http://minio:9000",
}
)
TITLE: Load and parse Flickr8k token file annotations DESCRIPTION: This code reads the 'Flickr8k.token.txt' file, which contains image annotations. It then processes each line to extract the image file name, a unique caption number, and the caption text itself, storing them as structured tuples for further processing.
LANGUAGE: python CODE:
with open(captions, "r") as fl:
annotations = fl.readlines()
# Converts the annotations where each element of this list is a tuple consisting of image file name, caption number and caption itself
annotations = list(map(lambda x: tuple([*x.split('\t')[0].split('#'), x.split('\t')[1]]), annotations))
TITLE: Lance File Footer and Overall Layout Specification DESCRIPTION: Provides a detailed byte-level specification of the .lance file format, including the arrangement of data pages, column metadata, offset tables, and the final footer. It outlines alignment requirements and the structure of various fields.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_4
LANGUAGE: APIDOC CODE:
// Note: the number of buffers (BN) is independent of the number of columns (CN)
// and pages.
//
// Buffers often need to be aligned. 64-byte alignment is common when
// working with SIMD operations. 4096-byte alignment is common when
// working with direct I/O. In order to ensure these buffers are aligned
// writers may need to insert padding before the buffers.
//
// If direct I/O is required then most (but not all) fields described
// below must be sector aligned. We have marked these fields with an
// asterisk for clarity. Readers should assume there will be optional
// padding inserted before these fields.
//
// All footer fields are unsigned integers written with little endian
// byte order.
//
// ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
TITLE: LanceDB Conflict Resolution Process DESCRIPTION: Outlines the commit process in LanceDB, detailing how writers handle concurrent modifications, create transaction files for conflict detection, and retry commits after checking for compatibility with successful writes.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_11
LANGUAGE: APIDOC CODE:
Commit Process:
1. Writer finishes writing all data files.
2. Writer creates a transaction file in _transactions directory.
- Purpose: detect conflicts, re-build manifest during retries.
3. Check for new commits since writer started.
- If conflicts detected (via transaction files), abort commit.
4. Build manifest and attempt to commit to next version.
- If commit fails due to concurrent write, go back to step 3.
Conflict Detection:
- Conservative approach: assume conflict if transaction file is missing or has unknown operation.
TITLE: Lance Dataset Directory Structure DESCRIPTION: Illustrates the typical organization of a Lance dataset within a directory, detailing the location of data files, version manifests, secondary indices, and deletion files.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_0
LANGUAGE: plaintext CODE:
/path/to/dataset:
data/*.lance -- Data directory
_versions/*.manifest -- Manifest file for each dataset version.
_indices/{UUID-*}/index.idx -- Secondary index, each index per directory.
_deletions/*.{arrow,bin} -- Deletion files, which contain ids of rows
that have been deleted.
TITLE: Define PyArrow Schema for Lance Dataset in Python
DESCRIPTION: This Python code defines a PyArrow schema for the Lance dataset. It specifies the data types for image_id
(string), image
(binary), and captions
(list of strings), ensuring proper data structure and type consistency for the dataset when written to Lance.
LANGUAGE: python CODE:
schema = pa.schema([
pa.field("image_id", pa.string()),
pa.field("image", pa.binary()),
pa.field("captions", pa.list_(pa.string(), -1)),
])
TITLE: Define image augmentations for CLIP model training
DESCRIPTION: This snippet defines a torchvision.transforms.Compose
object for basic image augmentations applied during CLIP model training. It includes converting images to tensors, resizing them to a consistent shape, and normalizing pixel values to stabilize the training process.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_4
LANGUAGE: python CODE:
train_augments = transforms.Compose(
[
transforms.ToTensor(),
transforms.Resize(Config.img_size),
transforms.Normalize([0.5], [0.5]),
]
)
TITLE: Generate GIST-1M Database Vectors Lance Dataset
DESCRIPTION: Uses the datagen.py
script to convert GIST-1M base vectors into a Lance dataset. This dataset will serve as the primary data source for indexing and querying in the GIST-1M benchmark.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_9
LANGUAGE: sh CODE:
./datagen.py ./gist/gist_base.fvecs ./.lancedb/gist1m.lance -g 1024 -m 50000 -d 960
TITLE: Set Object Store Timeout for a Single Dataset (Python)
DESCRIPTION: Shows how to specify storage options, such as a timeout, for a specific Lance dataset using the storage_options
parameter in lance.dataset
. This allows fine-grained control over individual dataset configurations without affecting global settings.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_1
LANGUAGE: python CODE:
import lance
ds = lance.dataset("s3://path", storage_options={"timeout": "60s"})
TITLE: Connect LanceDB to Google Cloud Storage
DESCRIPTION: This Python snippet demonstrates how to connect a LanceDB dataset to Google Cloud Storage using storage_options
to specify service account credentials. It provides an alternative to setting the GOOGLE_SERVICE_ACCOUNT
environment variable.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_7
LANGUAGE: python CODE:
import lance
ds = lance.dataset(
"gs://my-bucket/my-dataset",
storage_options={
"service_account": "path/to/service-account.json",
}
)
TITLE: Read and Write Lance Data with Ray and Pandas
DESCRIPTION: This snippet demonstrates how to write data to the Lance format using Ray's data sink (ray.data.Dataset.write_lance
) and subsequently read it back using both the native Lance API (lance.dataset
) and Ray's data source (ray.data.read_lance
). It includes assertions to verify data integrity after read/write operations.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/ray.md#_snippet_0
LANGUAGE: python CODE:
import ray
import pandas as pd
ray.init()
data = [
{"id": 1, "name": "alice"},
{"id": 2, "name": "bob"},
{"id": 3, "name": "charlie"}
]
ray.data.from_items(data).write_lance("./alice_bob_and_charlie.lance")
# It can be read via lance directly
df = (
lance.
dataset("./alice_bob_and_charlie.lance")
.to_table()
.to_pandas()
.sort_values(by=["id"])
.reset_index(drop=True)
)
assert df.equals(pd.DataFrame(data)), "{} != {}".format(
df, pd.DataFrame(data)
)
# Or via Ray.data.read_lance
ray_df = (
ray.data.read_lance("./alice_bob_and_charlie.lance")
.to_pandas()
.sort_values(by=["id"])
.reset_index(drop=True)
)
assert df.equals(ray_df)
TITLE: Load PyTorch Model State Dictionary from Lance Dataset (Python) DESCRIPTION: This function reads all model weights from a specified Lance dataset file and constructs an OrderedDict suitable as a PyTorch model state dictionary. It iterates through each weight, converting it using _load_weight, and places it on the specified device. This function assumes all weights can fit into memory; large models may cause memory errors.
LANGUAGE: python CODE:
def _load_state_dict(file_name: str, version: int = 1, map_location=None) -> OrderedDict:
"""Reads the model weights from lance file and returns a model state dict
If the model weights are too large, this function will fail with a memory error.
Args:
file_name (str): Lance model name
version (int): Version of the model to load
map_location (str): Device to load the model on
Returns:
OrderedDict: Model state dict
"""
ds = lance.dataset(file_name, version=version)
weights = ds.take([x for x in range(ds.count_rows())]).to_pylist()
state_dict = OrderedDict()
for weight in weights:
state_dict[weight["name"]] = _load_weight(weight).to(map_location)
return state_dict
TITLE: Load Data Chunk from Lance Dataset by Indices
DESCRIPTION: This utility function, from_indices
, efficiently loads specific elements from a Lance dataset based on a list of provided indices. It takes a Lance dataset object and a list of integer indices, then retrieves the corresponding rows. The function processes these rows to extract only the 'input_ids' from each, returning them as a list of token IDs, which is crucial for preparing data chunks.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_1
LANGUAGE: python CODE:
def from_indices(dataset, indices):
"""Load the elements on given indices from the dataset"""
chunk = dataset.take(indices).to_pylist()
chunk = list(map(lambda x: x['input_ids'], chunk))
return chunk
TITLE: Run LanceDB Vector Search Recall Test DESCRIPTION: Defines run_test, a comprehensive function for evaluating LanceDB's vector search recall. It generates ground truth, writes data to a temporary LanceDB dataset, creates an IVF_PQ index, and performs nearest neighbor queries with varying nprobes and refine_factor to calculate recall for both in-sample and out-of-sample queries.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_2
LANGUAGE: python CODE:
def run_test(
data,
query,
metric,
num_partitions=256,
num_sub_vectors=8,
nprobes_list=[1, 2, 5, 10, 16],
refine_factor_list=[1, 2, 5, 10, 20],
):
in_sample = data[random.sample(range(data.shape[0]), 1000), :]
# ground truth
print("generating gt")
gt = knn(query, data, metric, 10)
gt_in_sample = knn(in_sample, data, metric, 10)
print("generated gt")
with tempfile.TemporaryDirectory() as d:
write_lance(d, data)
ds = lance.dataset(d)
for q, target in zip(tqdm(in_sample, desc="checking brute force"), gt_in_sample):
res = ds.to_table(nearest={
"column": "vec",
"q": q,
"k": 10,
"metric": metric,
}, columns=["id"])
assert len(np.intersect1d(res["id"].to_numpy(), target)) == 10
ds = ds.create_index("vec", "IVF_PQ", metric=metric, num_partitions=num_partitions, num_sub_vectors=num_sub_vectors)
recall_data = []
for nprobes in nprobes_list:
for refine_factor in refine_factor_list:
hits = 0
# check that brute force impl is correct
for q, target in zip(tqdm(query, desc=f"out of sample, nprobes={nprobes}, refine={refine_factor}"), gt):
res = ds.to_table(nearest={
"column": "vec",
"q": q,
"k": 10,
"nprobes": nprobes,
"refine_factor": refine_factor,
}, columns=["id"])["id"].to_numpy()
hits += len(np.intersect1d(res, target))
recall_data.append([
"out_of_sample",
nprobes,
refine_factor,
hits / 10 / len(gt),
])
# check that brute force impl is correct
for q, target in zip(tqdm(in_sample, desc=f"in sample nprobes={nprobes}, refine={refine_factor}"), gt_in_sample):
res = ds.to_table(nearest={
"column": "vec",
"q": q,
"k": 10,
"nprobes": nprobes,
"refine_factor": refine_factor,
}, columns=["id"])["id"].to_numpy()
hits += len(np.intersect1d(res, target))
recall_data.append([
"in_sample",
nprobes,
refine_factor,
hits / 10 / len(gt_in_sample),
])
return recall_data
TITLE: Stream PyArrow RecordBatches to Lance Dataset
DESCRIPTION: Shows how to write a Lance dataset from an iterator of pyarrow.RecordBatch
objects. This method is ideal for large datasets that cannot be fully loaded into memory, requiring a pyarrow.Schema
to be provided.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_1
LANGUAGE: python CODE:
def producer() -> Iterator[pa.RecordBatch]:
"""An iterator of RecordBatches."""
yield pa.RecordBatch.from_pylist([{"name": "Alice", "age": 20}])
yield pa.RecordBatch.from_pylist([{"name": "Bob", "age": 30}])
schema = pa.schema([
("name", pa.string()),
("age", pa.int32()),
])
ds = lance.write_dataset(producer(),
"./alice_and_bob.lance",
schema=schema, mode="overwrite")
print(ds.count_rows()) # Output: 2
TITLE: LanceDB External Manifest Store Commit Operations DESCRIPTION: Details the four-step commit process when using an external manifest store for concurrent writes in LanceDB, involving staging manifests, committing paths to the external store, and finalizing the manifest in object storage.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_12
LANGUAGE: APIDOC CODE:
External Store Commit Process:
1. PUT_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid}
- Action: Stage new manifest in object store under unique path.
2. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest-{uuid}
- Action: Commit staged manifest path to external KV store.
- Note: Commit is effectively complete after this step.
3. COPY_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid} mydataset.lance/_versions/{version}.manifest
- Action: Copy staged manifest to final path.
4. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest
- Action: Update external store to point to final manifest.
TITLE: Write PyArrow Table to Lance Dataset
DESCRIPTION: Demonstrates how to write a pyarrow.Table
to a Lance dataset using the lance.write_dataset
function. This is suitable for datasets that can be fully loaded into memory.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_0
LANGUAGE: python CODE:
import lance
import pyarrow as pa
table = pa.Table.from_pylist([{"name": "Alice", "age": 20},
{"name": "Bob", "age": 30}])
ds = lance.write_dataset(table, "./alice_and_bob.lance")
TITLE: Lance DataFragment Protobuf Message Reference DESCRIPTION: References the Protobuf message definition for DataFragment, which represents a logical chunk of data within a Lance dataset. It can include one or more DataFiles and an optional DeletionFile.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_2
LANGUAGE: APIDOC CODE:
proto.message.DataFragment
TITLE: Import Libraries for LanceDB Vector Search Testing DESCRIPTION: Imports necessary Python libraries for numerical operations (numpy), temporary file handling (tempfile), data manipulation (pandas), plotting (seaborn, matplotlib), and LanceDB specific functionalities (lance, _lib). These imports provide the foundational tools for the vector search tests.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_1
LANGUAGE: python CODE:
from _lib import knn, write_lance, _get_nyt_vectors
import numpy as np
import tempfile
import random
import lance
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from tqdm.auto import tqdm
TITLE: Generate SIFT-1M Database Vectors Lance Dataset
DESCRIPTION: Uses the datagen.py
script to convert SIFT-1M base vectors into a Lance dataset. This dataset will serve as the primary data source for indexing and querying in the SIFT-1M benchmark.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_3
LANGUAGE: sh CODE:
./datagen.py ./sift/sift_base.fvecs ./.lancedb/sift1m.lance -d 128
TITLE: Compact LanceDB Dataset Files with Python
DESCRIPTION: This Python code demonstrates how to compact data files within a LanceDB dataset using the compact_files
method. It specifies a target_rows_per_fragment
to optimize file count and can remove soft-deleted rows, improving query performance. Note that compaction creates a new table version and invalidates old row addresses for indexing.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_21
LANGUAGE: python CODE:
import lance
dataset = lance.dataset("./alice_and_bob.lance")
dataset.optimize.compact_files(target_rows_per_fragment=1024 * 1024)
TITLE: Prepare PyTorch Model State Dict for LanceDB Saving
DESCRIPTION: This utility function processes a PyTorch model's state_dict
, iterating through each parameter. It flattens the parameter's tensor, extracts its name and original shape, and then packages this information into a PyArrow RecordBatch
. This prepares the model weights for efficient storage in a LanceDB dataset.
LANGUAGE: python CODE:
def _save_model_writer(state_dict):
"""Yields a RecordBatch for each parameter in the model state dict"""
for param_name, param in state_dict.items():
param_shape = list(param.size())
param_value = param.flatten().tolist()
yield pa.RecordBatch.from_arrays(
[
pa.array(
[param_name],
pa.string(),
),
pa.array(
[param_value],
pa.list_(pa.float64(), -1),
),
pa.array(
[param_shape],
pa.list_(pa.int64(), -1),
),
],
["name", "value", "shape"],
)
TITLE: Create PyTorch DataLoader from LanceDataset (Safe)
DESCRIPTION: This snippet demonstrates how to create a multiprocessing-safe PyTorch DataLoader using SafeLanceDataset
and get_safe_loader
. It explicitly uses the 'spawn' method to avoid fork-safety issues that can arise when LanceDB's internal multithreading interacts with Python's multiprocessing.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/pytorch.md#_snippet_2
LANGUAGE: python CODE:
from lance.torch.data import SafeLanceDataset, get_safe_loader
dataset = SafeLanceDataset(temp_lance_dataset)
# use spawn method to avoid fork-safe issue
loader = get_safe_loader(
dataset,
num_workers=2,
batch_size=16,
drop_last=False,
)
total_samples = 0
for batch in loader:
total_samples += batch["id"].shape[0]
TITLE: Generate SIFT-1M Ground Truth Lance Dataset
DESCRIPTION: Generates a ground truth Lance dataset for SIFT-1M using the gt.py
script. This dataset is essential for evaluating the recall of the benchmark queries against known correct results.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_4
LANGUAGE: sh CODE:
./gt.py ./.lancedb/sift1m.lance -o ./.lancedb/ground_truth.lance
TITLE: Lindera User Dictionary Configuration File (config.yml)
DESCRIPTION: YAML configuration for Lindera, defining the segmenter mode and the path to the dictionary. This file, typically named config.yml
, can be placed in the model's root directory or specified via the LINDERA_CONFIG_PATH
environment variable. The kind
field is not supported in LanceDB's context.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_6
LANGUAGE: yaml CODE:
segmenter:
mode: "normal"
dictionary:
# Note: in lance, the `kind` field is not supported. You need to specify the model path using the `path` field instead.
path: /path/to/lindera/ipadic/main
TITLE: Test LanceDB Vector Search with Random Data (L2 Metric) DESCRIPTION: Demonstrates running the run_test function with randomly generated data (100,000 vectors, 64 dimensions) and queries, using the L2 (Euclidean) distance metric. It then visualizes the recall results using make_plot.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_4
LANGUAGE: python CODE:
# test randomly generated data
data = np.random.standard_normal((100000, 64))
query = np.random.standard_normal((1000, 64))
recall_data = run_test(
data,
query,
"L2",
)
make_plot(recall_data)
TITLE: Lance ColumnMetadata Protobuf Message Reference DESCRIPTION: References the Protobuf message definition for ColumnMetadata, which is used to describe the encoding and properties of individual columns within a .lance file.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_3
LANGUAGE: APIDOC CODE:
proto.message.ColumnMetadata
TITLE: Generate GIST-1M Query Vectors Lance Dataset
DESCRIPTION: Converts GIST-1M query vectors into a Lance dataset using datagen.py
. These vectors will be used to perform similarity searches against the indexed database during the benchmark.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_10
LANGUAGE: sh CODE:
./datagen.py ./gist/gist_query.fvecs ./.lancedb/gist_query.lance -g 1024 -m 50000 -d 960 -n 1000
TITLE: Test LanceDB Vector Search with Random Data (Cosine Metric) DESCRIPTION: Shows how to execute the run_test function using randomly generated data and queries, but this time employing the cosine similarity metric. The recall performance is subsequently plotted using make_plot.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_5
LANGUAGE: python CODE:
# test randomly generated data -- cosine
data = np.random.standard_normal((100000, 64))
query = np.random.standard_normal((1000, 64))
recall_data = run_test(
data,
query,
"cosine",
)
make_plot(recall_data)
TITLE: Load PyTorch Model with Weights from Lance Dataset (Python) DESCRIPTION: This high-level function facilitates loading weights directly into a given PyTorch model from a Lance dataset. It internally calls _load_state_dict to retrieve the complete state dictionary and then applies it to the provided model instance. This simplifies the process of restoring a model's state from a Lance-backed storage.
LANGUAGE: python CODE:
def load_model(
model: torch.nn.Module, file_name: str, version: int = 1, map_location=None
):
"""Loads the model weights from lance file and sets them to the model
Args:
model (torch.nn.Module): PyTorch model
file_name (str): Lance model name
version (int): Version of the model to load
map_location (str): Device to load the model on
"""
state_dict = _load_state_dict(file_name, version=version, map_location=map_location)
model.load_state_dict(state_dict)
TITLE: Connect LanceDB to Azure Blob Storage
DESCRIPTION: This Python snippet illustrates how to connect a LanceDB dataset to Azure Blob Storage. It shows how to pass account_name
and account_key
directly via storage_options
, offering an alternative to environment variables.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_9
LANGUAGE: python CODE:
import lance
ds = lance.dataset(
"az://my-container/my-dataset",
storage_options={
"account_name": "some-account",
"account_key": "some-key",
}
)
TITLE: Default Lance Language Model Home Directory DESCRIPTION: This snippet illustrates the default directory path where LanceDB stores language models if the LANCE_LANGUAGE_MODEL_HOME environment variable is not explicitly set by the user.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_0
LANGUAGE: bash CODE:
${system data directory}/lance/language_models
TITLE: Perform Random Row Access in Lance Dataset
DESCRIPTION: This Python snippet demonstrates Lance's capability for fast random access to individual rows using the take()
method. This feature is crucial for workflows like random sampling, shuffling in ML training, and building secondary indices for enhanced query performance.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_20
LANGUAGE: python CODE:
data = ds.take([1, 100, 500], columns=["image", "label"])
TITLE: Configure AWS Credentials for LanceDB S3 Dataset
DESCRIPTION: Demonstrates how to pass AWS access key ID, secret access key, and session token directly to the storage_options
parameter when initializing a LanceDB dataset from an S3 path. This method provides explicit credential management for S3 access, overriding environment variables if set.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_3
LANGUAGE: python CODE:
import lance
ds = lance.dataset(
"s3://bucket/path",
storage_options={
"access_key_id": "my-access-key",
"secret_access_key": "my-secret-key",
"session_token": "my-session-token",
}
)
TITLE: Create Scalar Index with Jieba Tokenizer in Python DESCRIPTION: Python code demonstrating how to create a scalar index on a 'text' field using the 'INVERTED' index type, specifying 'jieba/default' as the base tokenizer for text processing within LanceDB.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_2
LANGUAGE: python CODE:
ds.create_scalar_index("text", "INVERTED", base_tokenizer="jieba/default")
TITLE: Add and Populate Columns with SQL Expressions in Lance DESCRIPTION: Illustrates adding and populating new columns in a Lance dataset using SQL expressions. This method allows defining column values based on existing columns or literal values, enabling data backfill within a single operation.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_1
LANGUAGE: python CODE:
table = pa.table({"name": pa.array(["Alice", "Bob", "Carla"])})
dataset = lance.write_dataset(table, "names")
dataset.add_columns({
"hash": "sha256(name)",
"status": "'active'",
})
print(dataset.to_table().to_pandas())
TITLE: Perform Nearest Neighbor Vector Search on Lance Dataset
DESCRIPTION: Demonstrates how to perform nearest neighbor searches on a Lance dataset with a vector index. It samples query vectors using DuckDB and then retrieves the top 10 similar vectors for each query using Lance's nearest
functionality, showcasing its vector search capabilities.
SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_9
LANGUAGE: python CODE:
# Get top 10 similar vectors
import duckdb
dataset = lance.dataset(uri)
# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])
# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
for q in query_vectors]
TITLE: Convert Parquet to Lance Dataset DESCRIPTION: Demonstrates how to convert a Pandas DataFrame to a PyArrow Table, save it as a Parquet file, and then convert the Parquet dataset into a Lance dataset. This showcases Lance's compatibility with existing data formats and its ease of use for data migration.
SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_2
LANGUAGE: python CODE:
import lance
import pandas as pd
import pyarrow as pa
import pyarrow.dataset
df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')
parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")
TITLE: Define PyArrow Schema with Lance Encoding Metadata DESCRIPTION: This Python snippet demonstrates how to define a PyArrow schema for a LanceDB table, applying column-level encoding configurations. It shows how to use PyArrow field metadata to specify compression algorithms, compression levels, structural encoding strategies, and packed memory layout for string columns.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_7
LANGUAGE: python CODE:
import pyarrow as pa
schema = pa.schema([
pa.field(
"compressible_strings",
pa.string(),
metadata={
"lance-encoding:compression": "zstd",
"lance-encoding:compression-level": "3",
"lance-encoding:structural-encoding": "miniblock",
"lance-encoding:packed": "true"
}
)
])
TITLE: Configure Seaborn Plot Style DESCRIPTION: This snippet imports the seaborn library and sets the default plot style to 'darkgrid'. This improves the visual aesthetics of subsequent plots generated using seaborn.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_1
LANGUAGE: python CODE:
import seaborn as sns
sns.set_style("darkgrid")
TITLE: Generate SIFT-1M Query Vectors Lance Dataset
DESCRIPTION: Converts SIFT-1M query vectors into a Lance dataset using datagen.py
. These vectors will be used to perform similarity searches against the indexed database during the benchmark.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/README.md#_snippet_5
LANGUAGE: sh CODE:
./datagen.py ./sift/sift_query.fvecs ./.lancedb/sift_query.lance -d 128 -n 1000
TITLE: Convert SIFT1M Dataset to Lance for Vector Search
DESCRIPTION: Loads the SIFT1M dataset from a binary file, converts its raw vector data into a NumPy array, and then transforms it into a Lance table using vec_to_table
. The dataset is then written to a Lance file, optimized for vector search with specific row group and file size settings.
SOURCE: https://github.com/lancedb/lance/blob/main/README.md#_snippet_7
LANGUAGE: python CODE:
import lance
from lance.vector import vec_to_table
import numpy as np
import struct
nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
buf = fobj.read()
data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
dd = dict(zip(range(nvecs), data))
table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)
TITLE: Load Entire Lance Dataset into Memory
DESCRIPTION: This Python snippet shows how to load an entire Lance dataset into an in-memory table using the to_table()
method. This approach is straightforward and suitable for datasets that can comfortably fit within available memory.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_13
LANGUAGE: python CODE:
table = ds.to_table()
TITLE: Lance SQL Type to Apache Arrow Type Mapping DESCRIPTION: This table provides a comprehensive mapping between SQL data types supported by Lance and their corresponding Apache Arrow data types. It details the internal storage format for various data representations, crucial for understanding data compatibility and performance.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_19
LANGUAGE: APIDOC CODE:
| SQL type | Arrow type |
|----------|------------|
| `boolean` | `Boolean` |
| `tinyint` / `tinyint unsigned` | `Int8` / `UInt8` |
| `smallint` / `smallint unsigned` | `Int16` / `UInt16` |
| `int` or `integer` / `int unsigned` or `integer unsigned` | `Int32` / `UInt32` |
| `bigint` / `bigint unsigned` | `Int64` / `UInt64` |
| `float` | `Float32` |
| `double` | `Float64` |
| `decimal(precision, scale)` | `Decimal128` |
| `date` | `Date32` |
| `timestamp` | `Timestamp` (1) |
| `string` | `Utf8` |
| `binary` | `Binary` |
TITLE: Visualize LanceDB Vector Search Recall Heatmap DESCRIPTION: Defines make_plot, a utility function to visualize the recall data generated by run_test. It takes the recall data (a list of lists) and converts it into a pandas DataFrame, then uses seaborn to generate heatmaps showing recall across different nprobes and refine_factor values for various test cases.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_3
LANGUAGE: python CODE:
def make_plot(recall_data):
df = pd.DataFrame(recall_data, columns=["case", "nprobes", "refine_factor", "recall"])
num_cases = len(df["case"].unique())
(fig, axs) = plt.subplots(1, 2, figsize=(16, 8))
for case, ax in zip(df["case"].unique(), axs):
current_case = df[df["case"] == case]
sns.heatmap(
current_case.drop(columns=["case"]).set_index(["nprobes", "refine_factor"])["recall"].unstack(),
annot=True,
ax=ax,
).set(title=f"Recall -- {case}")
TITLE: Count unique video titles in dataset DESCRIPTION: Converts the loaded dataset to a Pandas DataFrame and counts the number of unique video titles. This provides an overview of the diversity and scope of the video content within the dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/notebooks/youtube_transcript_search.ipynb#_snippet_2
LANGUAGE: python CODE:
data.to_pandas().title.nunique()
TITLE: Describe Median Latency by Refine Factor DESCRIPTION: This snippet groups the DataFrame by the 'refine_factor' column and calculates descriptive statistics for the '50%' (median response time) column. This provides an understanding of latency variations across different refinement factors.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_6
LANGUAGE: python CODE:
df.groupby("refine_factor")["50%"].describe()
TITLE: Utility functions to load images and captions from Lance dataset
DESCRIPTION: These two Python functions, load_image
and load_caption
, facilitate loading data from a Lance dataset. load_image
converts byte-formatted images to a usable image format using numpy and OpenCV, while load_caption
extracts the longest caption associated with an image, assuming it contains the most descriptive information.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_2
LANGUAGE: python CODE:
def load_image(ds, idx):
# Utility function to load an image at an index and convert it from bytes format to img format
raw_img = ds.take([idx], columns=['image']).to_pydict()
raw_img = np.frombuffer(b''.join(raw_img['image']), dtype=np.uint8)
img = cv2.imdecode(raw_img, cv2.IMREAD_COLOR)
img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
return img
def load_caption(ds, idx):
# Utility function to load an image's caption. Currently we return the longest caption of all
captions = ds.take([idx], columns=['captions']).to_pydict()['captions'][0]
return max(captions, key=len)
TITLE: Save PyTorch Model Weights to LanceDB with Versioning
DESCRIPTION: This function saves a PyTorch model's state_dict
to a LanceDB file. It utilizes the _save_model_writer
utility to format the data. The function supports both overwriting existing model weights or saving them as a new version within the Lance dataset, providing flexibility for model checkpoint management.
LANGUAGE: python CODE:
def save_model(state_dict: OrderedDict, file_name: str, version=False):
"""Saves a PyTorch model in lance file format
Args:
state_dict (OrderedDict): Model state dict
file_name (str): Lance model name
version (bool): Whether to save as a new version or overwrite the existing versions,
if the lance file already exists
"""
# Create a reader
reader = pa.RecordBatchReader.from_batches(
GLOBAL_SCHEMA, _save_model_writer(state_dict)
)
if os.path.exists(file_name):
if version:
# If we want versioning, we use the overwrite mode to create a new version
lance.write_dataset(
reader, file_name, schema=GLOBAL_SCHEMA, mode="overwrite"
)
else:
# If we don't want versioning, we delete the existing file and write a new one
shutil.rmtree(file_name)
lance.write_dataset(reader, file_name, schema=GLOBAL_SCHEMA)
else:
# If the file doesn't exist, we write a new one
lance.write_dataset(reader, file_name, schema=GLOBAL_SCHEMA)
TITLE: Protobuf Definition for Row ID Sequence Storage
DESCRIPTION: This protobuf oneof field defines how row ID sequences are stored. Small sequences are stored directly as inline_sequence
bytes to avoid I/O overhead, while large sequences are referenced via an external_file
path to optimize storage and retrieval.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_16
LANGUAGE: Protobuf CODE:
oneof row_id_sequence {
// Inline sequence
bytes inline_sequence = 1;
// External file reference
string external_file = 2;
} // row_id_sequence
TITLE: Drop Columns in LanceDB Dataset
DESCRIPTION: This snippet demonstrates how to drop columns from a LanceDB dataset using the lance.LanceDataset.drop_columns
method. This is a metadata-only operation, making it very fast. It also explains that physical data removal requires lance.dataset.DatasetOptimizer.compact_files()
followed by lance.LanceDataset.cleanup_old_versions()
.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_4
LANGUAGE: python CODE:
table = pa.table({"id": pa.array([1, 2, 3]),
"name": pa.array(["Alice", "Bob", "Carla"])})
dataset = lance.write_dataset(table, "names", mode="overwrite")
dataset.drop_columns(["name"])
print(dataset.schema)
# id: int64
TITLE: Define CLIP Model Components (ImageEncoder, TextEncoder, Head) in PyTorch DESCRIPTION: This snippet defines the core neural network modules for a CLIP-like model. ImageEncoder uses a pre-trained vision model (e.g., ResNet) to convert images to feature vectors. TextEncoder uses a pre-trained language model (e.g., BERT) for text embeddings. The Head module projects these features into a common embedding space using linear layers, GELU activation, dropout, and layer normalization.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/clip_training.md#_snippet_5
LANGUAGE: python CODE:
class ImageEncoder(nn.Module):
"""Encodes the Image"""
def __init__(self, model_name, pretrained = True):
super().__init__()
self.backbone = timm.create_model(
model_name,
pretrained=pretrained,
num_classes=0,
global_pool="avg"
)
for param in self.backbone.parameters():
param.requires_grad = True
def forward(self, img):
return self.backbone(img)
class TextEncoder(nn.Module):
"""Encodes the Caption"""
def __init__(self, model_name):
super().__init__()
self.backbone = AutoModel.from_pretrained(model_name)
for param in self.backbone.parameters():
param.requires_grad = True
def forward(self, captions):
output = self.backbone(**captions)
return output.last_hidden_state[:, 0, :]
class Head(nn.Module):
"""Projects both into Embedding space"""
def __init__(self, embedding_dim, projection_dim):
super().__init__()
self.projection = nn.Linear(embedding_dim, projection_dim)
self.gelu = nn.GELU()
self.fc = nn.Linear(projection_dim, projection_dim)
self.dropout = nn.Dropout(0.3)
self.layer_norm = nn.LayerNorm(projection_dim)
def forward(self, x):
projected = self.projection(x)
x = self.gelu(projected)
x = self.fc(x)
x = self.dropout(x)
x += projected
return self.layer_norm(x)
TITLE: Retrieve Specific Records from a Lance Dataset in Rust DESCRIPTION: Retrieves specific records from a Lance Dataset based on their indices and a projection. The result is a RecordBatch containing the requested data.
SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_3
LANGUAGE: rust CODE:
let values: Result<RecordBatch> = dataset.take(&[200, 199, 39, 40, 100], &projection).await;
TITLE: Define PyArrow schema for storing PyTorch model weights in Lance
DESCRIPTION: This snippet defines a pyarrow.Schema
named GLOBAL_SCHEMA
specifically designed for storing PyTorch model weights within the Lance file format. The schema includes three fields: 'name' (string) for the weight's identifier, 'value' (list of float64) for the flattened weight tensor, and 'shape' (list of int64) to preserve the original dimensions for reconstruction.
LANGUAGE: python CODE:
GLOBAL_SCHEMA = pa.schema(
[
pa.field("name", pa.string()),
pa.field("value", pa.list_(pa.float64(), -1)),
pa.field("shape", pa.list_(pa.int64(), -1)) # Is a list with variable shape because weights can have any number of dims
]
)
TITLE: Create Lance ImageURIArray from URI List
DESCRIPTION: This snippet demonstrates how to initialize a lance.arrow.ImageURIArray
from a list of image URIs. This array type is designed to store references to images in various storage systems (local, file, S3) for lazy loading, without validating or loading the images into memory immediately.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_3
LANGUAGE: python CODE:
from lance.arrow import ImageURIArray
ImageURIArray.from_uris([
"/tmp/image1.jpg",
"file:///tmp/image2.jpg",
"s3://example/image3.jpg"
])
# <lance.arrow.ImageURIArray object at 0x...>
# ['/tmp/image1.jpg', 'file:///tmp/image2.jpg', 's3://example/image3.jpg']
TITLE: Lance Execution Node Contract Definition DESCRIPTION: Defines the contract for various execution nodes within Lance's I/O execution plan, detailing their parameters, input schemas, and output schemas.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/I-O-Execution.md#_snippet_0
LANGUAGE: APIDOC CODE:
Execution Nodes:
Scan:
Parameters: dataset, projected columns
Input Schema: N/A
Output Schema: projected columns
Filter:
Parameters: input node, filter
Input Schema: any
Output Schema: input schema + columns in filters
Take:
Parameters: input node
Input Schema: any, must have a "_rowid" column
Output Schema: input schema minus _rowid
KNNFlatExec:
Parameters: input node, query
Input Schema: any
Output Schema: input schema + {"scores"}
KNNIndexExec:
Parameters: dataset
Input Schema: N/A
Output Schema: {"score", "_rowid"}
TITLE: Drop Columns from LanceDB Dataset in Java DESCRIPTION: Shows how to remove specified columns from a LanceDB dataset. This operation simplifies the dataset's schema by eliminating unnecessary fields.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_8
LANGUAGE: java CODE:
void dropColumns() {
String datasetPath = ""; // specify a path point to a dataset
try (BufferAllocator allocator = new RootAllocator()) {
try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
dataset.dropColumns(Collections.singletonList("name"));
}
}
}
TITLE: Describe Median Latency by IVF Index DESCRIPTION: This snippet groups the DataFrame by the 'ivf' column and calculates descriptive statistics (count, mean, std, min, max, quartiles) for the '50%' (median response time) column. This helps understand latency distribution across different IVF index configurations.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_3
LANGUAGE: python CODE:
df.groupby("ivf")["50%"].describe()
TITLE: Update Rows with Complex SQL Expressions DESCRIPTION: Shows how to update column values using complex SQL expressions that can reference existing columns, such as incrementing an age column by a fixed value.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_5
LANGUAGE: python CODE:
import lance
dataset = lance.dataset("./alice_and_bob.lance")
dataset.update({"age": "age + 2"})
TITLE: Add Rows to Lance Dataset
DESCRIPTION: Illustrates two methods for adding new rows to an existing Lance dataset: using the LanceDataset.insert
method for direct insertion and using lance.write_dataset
with mode="append"
to append new data.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_2
LANGUAGE: python CODE:
import lance
import pyarrow as pa
table = pa.Table.from_pylist([{"name": "Alice", "age": 20},
{"name": "Bob", "age": 30}])
ds = lance.write_dataset(table, "./insert_example.lance")
new_table = pa.Table.from_pylist([{"name": "Carla", "age": 37}])
ds.insert(new_table)
print(ds.to_table().to_pandas())
# name age
# 0 Alice 20
# 1 Bob 30
# 2 Carla 37
new_table2 = pa.Table.from_pylist([{"name": "David", "age": 42}])
ds = lance.write_dataset(new_table2, ds, mode="append")
print(ds.to_table().to_pandas())
# name age
# 0 Alice 20
# 1 Bob 30
# 2 Carla 37
# 3 David 42
TITLE: Bulk Update Rows in LanceDB Dataset using Merge Insert
DESCRIPTION: Demonstrates how to efficiently replace existing rows in a LanceDB dataset with new data using merge_insert
and when_matched_update_all()
. This operation uses a key for matching rows, typically a unique identifier. Note that modified rows are re-inserted, changing their position to the end of the table.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_7
LANGUAGE: python CODE:
import lance
dataset = lance.dataset("./alice_and_bob.lance")
print(dataset.to_table().to_pandas())
# name age
# 0 Alice 20
# 1 Bob 30
# Change the ages of both Alice and Bob
new_table = pa.Table.from_pylist([{"name": "Alice", "age": 2},
{"name": "Bob", "age": 3}])
# This will use `name` as the key for matching rows. Merge insert
# uses a JOIN internally and so you typically want this column to
# be a unique key or id of some kind.
rst = dataset.merge_insert("name") \
.when_matched_update_all() \
.execute(new_table)
print(dataset.to_table().to_pandas())
# name age
# 0 Alice 2
# 1 Bob 3
TITLE: Load Single Weight Tensor from Lance Dataset (Python) DESCRIPTION: This function converts a single weight entry, retrieved as a dictionary from a Lance dataset, into a PyTorch tensor. It reshapes the flattened 'value' array using the 'shape' information stored within the weight dictionary. The output is a torch.Tensor ready for further processing.
LANGUAGE: python CODE:
def _load_weight(weight: dict) -> torch.Tensor:
"""Converts a weight dict to a torch tensor"""
return torch.tensor(weight["value"], dtype=torch.float64).reshape(weight["shape"])
TITLE: Perform Parallel Writes with lance.fragment.write_fragments
DESCRIPTION: This code demonstrates how to write new data fragments in parallel across multiple workers using lance.fragment.write_fragments
. Each worker generates its own set of fragments, which are then printed for verification. This is the first phase of a distributed write operation, preparing data for a later commit.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_0
LANGUAGE: python CODE:
import json
from lance.fragment import write_fragments
# Run on each worker
data_uri = "./dist_write"
schema = pa.schema([
("a", pa.int32()),
("b", pa.string()),
])
# Run on worker 1
data1 = {
"a": [1, 2, 3],
"b": ["x", "y", "z"],
}
fragments_1 = write_fragments(data1, data_uri, schema=schema)
print("Worker 1: ", fragments_1)
# Run on worker 2
data2 = {
"a": [4, 5, 6],
"b": ["u", "v", "w"],
}
fragments_2 = write_fragments(data2, data_uri, schema=schema)
print("Worker 2: ", fragments_2)
TITLE: Drop Lance Dataset in Java
DESCRIPTION: This Java code illustrates how to permanently delete a Lance dataset from the file system. It takes the dataset's path and uses the static Dataset.drop
method to remove all associated files and metadata. This operation is irreversible and should be used with caution.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_4
LANGUAGE: Java CODE:
void dropDataset() {
String datasetPath = tempDir.resolve("drop_stream").toString();
Dataset.drop(datasetPath, new HashMap<>());
}
TITLE: LanceDB Statistics Storage DESCRIPTION: Describes how statistics (null count, min, max) are stored within Lance files in a columnar format, enabling selective reading for query optimization.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_14
LANGUAGE: APIDOC CODE:
Statistics Storage:
- Location: Stored within Lance files.
- Purpose: Determine which pages to skip during queries.
- Data Points: null count, lower bound (min), upper bound (max).
- Format: Lance's columnar format.
- Benefit: Allows selective reading of relevant stats columns.
TITLE: Alter Columns in LanceDB Dataset in Java DESCRIPTION: Illustrates how to modify existing columns within a LanceDB dataset. This includes renaming a column, changing its nullability, or casting its data type to a new ArrowType, facilitating schema adjustments.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_7
LANGUAGE: java CODE:
void alterColumns() {
String datasetPath = ""; // specify a path point to a dataset
try (BufferAllocator allocator = new RootAllocator()) {
try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
ColumnAlteration nameColumnAlteration =
new ColumnAlteration.Builder("name")
.rename("new_name")
.nullable(true)
.castTo(new ArrowType.Utf8())
.build();
dataset.alterColumns(Collections.singletonList(nameColumnAlteration));
}
}
}
TITLE: Group and sort captions by image ID DESCRIPTION: This section iterates through all unique image IDs found in the annotations. For each image, it collects all associated captions and sorts them based on their original annotation number, ensuring the correct order of captions for each image. The result is a list of tuples, each containing an image ID and a tuple of its ordered captions.
LANGUAGE: python CODE:
captions = []
image_ids = set(ann[0] for ann in annotations)
for img_id in tqdm(image_ids):
current_img_captions = []
for ann_img_id, num, caption in annotations:
if img_id == ann_img_id:
current_img_captions.append((num, caption))
# Sort by the annotation number
current_img_captions.sort(key=lambda x: x[0])
captions.append((img_id, tuple([x[1] for x in current_img_captions])))
TITLE: Create Scalar Index with Lindera Tokenizer in Python DESCRIPTION: Python code demonstrating how to create a scalar index on a 'text' field using the 'INVERTED' index type, specifying 'lindera/ipadic' as the base tokenizer for text processing within LanceDB.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_5
LANGUAGE: python CODE:
ds.create_scalar_index("text", "INVERTED", base_tokenizer="lindera/ipadic")
TITLE: Create Pandas Series with Lance BFloat16 Dtype
DESCRIPTION: This snippet demonstrates how to create a Pandas Series using the lance.bfloat16
custom dtype. It shows the initialization of a Series with floating-point numbers, which are then converted to the BFloat16 format, suitable for machine learning applications.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_0
LANGUAGE: python CODE:
import lance.arrow
pd.Series([1.1, 2.1, 3.4], dtype="lance.bfloat16")
# 0 1.1015625
# 1 2.09375
# 2 3.40625
# dtype: lance.bfloat16
TITLE: Define Lance Schema with Blob Column in Python
DESCRIPTION: This Python code demonstrates how to define a PyArrow schema for a Lance dataset, marking a large_binary
column as a blob column by setting the lance-encoding:blob
metadata to true
. This configuration enables Lance to efficiently store and retrieve large binary objects.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/blob.md#_snippet_0
LANGUAGE: python CODE:
import pyarrow as pa
schema = pa.schema(
[
pa.field("id", pa.int64()),
pa.field("video",
pa.large_binary(),
metadata={"lance-encoding:blob": "true"}
),
]
)
TITLE: Describe Median Latency by NProbes DESCRIPTION: This snippet groups the DataFrame by the 'nprobes' column and calculates descriptive statistics for the '50%' (median response time) column. This helps analyze how the number of probes affects median query latency.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_5
LANGUAGE: python CODE:
df.groupby("nprobes")["50%"].describe()
TITLE: Test LanceDB Vector Search with NYT TF-IDF Vectors (Cosine Metric) DESCRIPTION: Illustrates testing LanceDB's vector search with real-world data: sparse TF-IDF vectors from the New York Times dataset, projected to 256 dimensions. It uses the cosine similarity metric and custom index parameters (num_partitions=256, num_sub_vectors=32).
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_6
LANGUAGE: python CODE:
# test NYT -- TF-IDF sparse vectors projected on to 256D dense -- cosine
data = _get_nyt_vectors()
data = data[np.linalg.norm(data, axis=1) != 0]
data = np.unique(data, axis=0)
query = np.random.standard_normal((100, 256))
recall_data = run_test(
data,
query,
"cosine",
num_partitions=256,
num_sub_vectors=32,
)
make_plot(recall_data)
TITLE: Test LanceDB Vector Search with NYT TF-IDF Vectors (Normalized L2 Metric) DESCRIPTION: Presents a test case using the same NYT TF-IDF vectors, but normalized for L2 distance, effectively making it equivalent to cosine similarity on normalized vectors. It uses the L2 metric with specific index parameters (num_partitions=512, num_sub_vectors=32) and visualizes the recall.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/full_report/report.ipynb#_snippet_7
LANGUAGE: python CODE:
# test NYT -- TF-IDF sparse vectors projected on to 256D dense -- normalized L2
data = _get_nyt_vectors()
data = data[np.linalg.norm(data, axis=1) != 0]
data = np.unique(data, axis=0)
data /= np.linalg.norm(data, axis=1)[:, None]
# use the same out of sample query
recall_data = run_test(
data,
query,
"L2",
num_partitions=512,
num_sub_vectors=32,
)
make_plot(recall_data)
TITLE: Update Rows in Lance Dataset by SQL Expression
DESCRIPTION: Demonstrates how to update specific columns of rows in a Lance dataset using the lance.LanceDataset.update
method. The update values are SQL expressions, allowing for direct value assignment.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_4
LANGUAGE: python CODE:
import lance
dataset = lance.dataset("./alice_and_bob.lance")
dataset.update({"name": "'Bob'"}, where="name = 'Blob'")
TITLE: Iteratively Read Large Lance Dataset in Batches
DESCRIPTION: This Python snippet demonstrates how to read a Lance dataset in batches, which is ideal for datasets too large to fit into memory. It uses to_batches()
with column projection and filter push-down, allowing processing of data chunks iteratively.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_15
LANGUAGE: python CODE:
for batch in ds.to_batches(columns=["image"], filter="label = 10"):
# do something with batch
compute_on_batch(batch)
TITLE: Perform Upsert Operation (Update or Insert) in LanceDB
DESCRIPTION: Shows how to combine when_matched_update_all()
and when_not_matched_insert_all()
within merge_insert
to achieve an 'upsert' behavior. This operation updates rows if they exist and inserts them if they do not, providing a flexible way to synchronize data.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_9
LANGUAGE: python CODE:
import lance
import pyarrow as pa
# Change Carla's age and insert David
new_table = pa.Table.from_pylist([{"name": "Carla", "age": 27},
{"name": "David", "age": 42}])
dataset = lance.dataset("./alice_and_bob.lance")
# This will update Carla and insert David
_ = dataset.merge_insert("name") \
.when_matched_update_all() \
.when_not_matched_insert_all() \
.execute(new_table)
# Verify the results
print(dataset.to_table().to_pandas())
# name age
# 0 Alice 20
# 1 Bob 30
# 2 Carla 27
# 3 David 42
TITLE: Configure LanceDB for S3 Express One Zone Buckets
DESCRIPTION: Shows how to explicitly configure LanceDB to access S3 Express One Zone (directory) buckets, especially when the bucket name is hidden by an access point or private link. This involves setting the region
and s3_express
flag in storage_options
for direct access.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/object_store.md#_snippet_6
LANGUAGE: python CODE:
import lance
ds = lance.dataset(
"s3://my-bucket--use1-az4--x-s3/path/imagenet.lance",
storage_options={
"region": "us-east-1",
"s3_express": "true",
}
)
TITLE: Add Schema-Only Columns to Lance Dataset
DESCRIPTION: Demonstrates how to add new columns to a Lance dataset without populating them, using pyarrow.Field
or pyarrow.Schema
. This operation is metadata-only and very efficient, useful for lazy population.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_0
LANGUAGE: python CODE:
table = pa.table({"id": pa.array([1, 2, 3])})
dataset = lance.write_dataset(table, "null_columns")
# With pyarrow Field
dataset.add_columns(pa.field("embedding", pa.list_(pa.float32(), 128)))
assert dataset.schema == pa.schema([
("id", pa.int64()),
("embedding", pa.list_(pa.float32(), 128)),
])
# With pyarrow Schema
dataset.add_columns(pa.schema([
("label", pa.string()),
("score", pa.float32()),
]))
assert dataset.schema == pa.schema([
("id", pa.int64()),
("embedding", pa.list_(pa.float32(), 128)),
("label", pa.string()),
("score", pa.float32()),
])
TITLE: Commit Collected Fragments to a Lance Dataset
DESCRIPTION: After parallel writes, this snippet shows how to serialize fragment metadata from all workers, collect them on a single worker, and then commit them to a Lance dataset using lance.LanceOperation.Overwrite
. It verifies the commit by reading the dataset and asserting its properties, demonstrating the final step of a distributed write.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_1
LANGUAGE: python CODE:
import json
from lance import FragmentMetadata, LanceOperation
# Serialize Fragments into JSON data
fragments_json1 = [json.dumps(fragment.to_json()) for fragment in fragments_1]
fragments_json2 = [json.dumps(fragment.to_json()) for fragment in fragments_2]
# On one worker, collect all fragments
all_fragments = [FragmentMetadata.from_json(f) for f in \
fragments_json1 + fragments_json2]
# Commit the fragments into a single dataset
# Use LanceOperation.Overwrite to overwrite the dataset or create new dataset.
op = lance.LanceOperation.Overwrite(schema, all_fragments)
read_version = 0 # Because it is empty at the time.
lance.LanceDataset.commit(
data_uri,
op,
read_version=read_version,
)
# We can read the dataset using the Lance API:
dataset = lance.dataset(data_uri)
assert len(dataset.get_fragments()) == 2
assert dataset.version == 1
print(dataset.to_table().to_pandas())
TITLE: Merge Pre-computed Columns into Lance Dataset
DESCRIPTION: Explains how to integrate pre-computed columns into an existing Lance dataset using the merge
method. This approach avoids rewriting the entire dataset by joining new data based on a specified column, as demonstrated with an 'id' column.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_3
LANGUAGE: python CODE:
table = pa.table({
"id": pa.array([1, 2, 3]),
"embedding": pa.array([np.array([1, 2, 3]), np.array([4, 5, 6]),
np.array([7, 8, 9])])
})
dataset = lance.write_dataset(table, "embeddings", mode="overwrite")
new_data = pa.table({
"id": pa.array([1, 2, 3]),
"label": pa.array(["horse", "rabbit", "cat"])
})
dataset.merge(new_data, "id")
print(dataset.to_table().to_pandas())
TITLE: SQL Filter Expression with Escaped Column Names DESCRIPTION: This SQL snippet shows how to handle column names that are SQL keywords or contain special characters (like spaces) by escaping them with backticks. It also demonstrates accessing nested fields with escaped names to ensure correct parsing.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_17
LANGUAGE: sql CODE:
`CUBE` = 10 AND `column name with space` IS NOT NULL
AND `nested with space`.`inner with space` < 2
TITLE: LanceDB Page-level Statistics Schema Definition DESCRIPTION: This schema defines the structure for storing page-level statistics for each field (column) within a Lance file. It includes the null count, minimum value, and maximum value for each field, typed according to the field's original data type. The schema is flexible, allowing for missing fields and future extensions.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_15
LANGUAGE: APIDOC CODE:
<field_id_1>: struct
null_count: i64
min_value: <field_1_data_type>
max_value: <field_1_data_type>
...
<field_id_N>: struct
null_count: i64
min_value: <field_N_data_type>
max_value: <field_N_data_type>
TITLE: Define Custom TensorSpec for Lance TensorFlow Dataset Output
DESCRIPTION: This code shows how to explicitly define the tf.TensorSpec
for the output signature of a tf.data.Dataset
created from Lance. This is crucial for precise type and shape control, especially when automatic inference is insufficient or for complex data structures like ragged tensors.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_2
LANGUAGE: python CODE:
batch_size = 256
ds = lance.tf.data.from_lance(
"s3://my-bucket/my-dataset",
columns=["image", "labels"],
batch_size=batch_size,
output_signature={
"image": tf.TensorSpec(shape=(), dtype=tf.string),
"labels": tf.RaggedTensorSpec(
dtype=tf.int32, shape=(batch_size, None), ragged_rank=1),
},
TITLE: SQL Literals for Date, Timestamp, and Decimal Types DESCRIPTION: This SQL snippet illustrates how to specify literals for date, timestamp, and decimal columns in Lance filter expressions. It shows the syntax for casting string values to specific data types, ensuring correct interpretation during query execution.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_18
LANGUAGE: sql CODE:
date_col = date '2021-01-01'
and timestamp_col = timestamp '2021-01-01 00:00:00'
and decimal_col = decimal(8,3) '1.000'
TITLE: Add New Columns to a Lance Dataset in a Distributed Manner
DESCRIPTION: This snippet demonstrates adding new columns to a Lance dataset efficiently without copying existing data. It shows how to merge columns on individual fragments across workers using frag.merge_columns
and then commit the changes using lance.LanceOperation.Merge
on a single worker. This leverages Lance's two-dimensional layout for metadata-only operations, making column additions highly efficient.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/distributed_write.md#_snippet_3
LANGUAGE: python CODE:
import lance
from pyarrow import RecordBatch
import pyarrow.compute as pc
dataset = lance.dataset("./add_columns_example")
assert len(dataset.get_fragments()) == 2
assert dataset.to_table().combine_chunks() == pa.Table.from_pydict({
"name": ["alice", "bob", "charlie", "craig", "dave", "eve"],
"age": [25, 33, 44, 55, 66, 77],
}, schema=schema)
def name_len(names: RecordBatch) -> RecordBatch:
return RecordBatch.from_arrays(
[pc.utf8_length(names["name"])],
["name_len"],
)
# On Worker 1
frag1 = dataset.get_fragments()[0]
new_fragment1, new_schema = frag1.merge_columns(name_len, ["name"])
# On Worker 2
frag2 = dataset.get_fragments()[1]
new_fragment2, _ = frag2.merge_columns(name_len, ["name"])
# On Worker 3 - Commit
all_fragments = [new_fragment1, new_fragment2]
op = lance.LanceOperation.Merge(all_fragments, schema=new_schema)
lance.LanceDataset.commit(
"./add_columns_example",
op,
read_version=dataset.version,
)
# Verify dataset
dataset = lance.dataset("./add_columns_example")
print(dataset.to_table().to_pandas())
TITLE: Plot Median Query Latency Histogram
DESCRIPTION: This snippet generates a histogram of the median query latency using seaborn's displot
function. It visualizes the distribution of the '50%' column (median response time) from the DataFrame and sets appropriate x and y axis labels.
SOURCE: https://github.com/lancedb/lance/blob/main/benchmarks/sift/Results.ipynb#_snippet_2
LANGUAGE: python CODE:
ax = sns.displot(df, x="50%")
ax.set(xlabel="Median response time seconds", ylabel="Number of configurations")
TITLE: Implement Custom PyTorch Sampler for Non-Overlapping Data
DESCRIPTION: The LanceSampler
class is a custom PyTorch Sampler
designed to prevent overlapping samples during LLM training, which can lead to overfitting. It ensures that the indices returned are block_size
apart, guaranteeing that each sample processed by the model is unique and non-redundant. The sampler pre-calculates and shuffles available indices, yielding them during iteration to provide distinct data chunks for each batch.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/python/llm_training.md#_snippet_3
LANGUAGE: python CODE:
class LanceSampler(Sampler):
r"""Samples tokens randomly but `block_size` indices apart.
Args:
data_source (Dataset): dataset to sample from
block_size (int): minimum index distance between each random sample
"""
def __init__(self, data_source, block_size=512):
self.data_source = data_source
self.num_samples = len(self.data_source)
self.available_indices = list(range(0, self.num_samples, block_size))
np.random.shuffle(self.available_indices)
def __iter__(self):
yield from self.available_indices
def __len__(self) -> int:
return len(self.available_indices)
TITLE: Insert New Rows Only in LanceDB Dataset
DESCRIPTION: Illustrates how to use merge_insert
with when_not_matched_insert_all()
to insert data only if it doesn't already exist in the dataset. This is useful for preventing duplicate entries when processing batches of data where some records might have been added previously.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_8
LANGUAGE: python CODE:
# Bob is already in the table, but Carla is new
new_table = pa.Table.from_pylist([{"name": "Bob", "age": 30},
{"name": "Carla", "age": 37}])
dataset = lance.dataset("./alice_and_bob.lance")
# This will insert Carla but leave Bob unchanged
_ = dataset.merge_insert("name") \
.when_not_matched_insert_all() \
.execute(new_table)
# Verify that Carla was added but Bob remains unchanged
print(dataset.to_table().to_pandas())
# name age
# 0 Alice 20
# 1 Bob 30
# 2 Carla 37
TITLE: Replace Filtered Data with New Rows in LanceDB
DESCRIPTION: Explains a less common but powerful use case of merge_insert
to replace a specific region of existing rows (defined by a filter) with new data. This effectively acts as a combined delete and insert operation within a single transaction, using when_not_matched_by_source_delete()
.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_10
LANGUAGE: python CODE:
import lance
import pyarrow as pa
new_table = pa.Table.from_pylist([{"name": "Edgar", "age": 46},
{"name": "Francene", "age": 44}])
dataset = lance.dataset("./alice_and_bob.lance")
print(dataset.to_table().to_pandas())
# name age
# 0 Alice 20
# 1 Bob 30
# 2 Charlie 45
# 3 Donna 50
# This will remove anyone above 40 and insert our new data
_ = dataset.merge_insert("name") \
.when_not_matched_insert_all() \
.when_not_matched_by_source_delete("age >= 40") \
.execute(new_table)
# Verify the results - people over 40 replaced with new data
print(dataset.to_table().to_pandas())
# name age
# 0 Alice 20
# 1 Bob 30
# 2 Edgar 46
# 3 Francene 44
TITLE: Distributed Training with Shuffled Lance Fragments in TensorFlow
DESCRIPTION: This snippet outlines a strategy for distributed training by sharding and shuffling Lance fragments across multiple workers. It uses lance_fragments
to manage the distribution of data, ensuring each worker processes a unique subset of the dataset for efficient parallel training.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/tensorflow.md#_snippet_3
LANGUAGE: python CODE:
import tensorflow as tf
from lance.tf.data import from_lance, lance_fragments
world_size = 32
rank = 10
seed = 123 #
epoch = 100
dataset_uri = "s3://my-bucket/my-dataset"
# Shuffle fragments distributedly.
fragments =
lance_fragments("s3://my-bucket/my-dataset")
.shuffling(32, seed=seed)
.repeat(epoch)
.enumerate()
.filter(lambda i, _: i % world_size == rank)
.map(lambda _, fid: fid)
ds = from_lance(
uri,
columns=["image", "label"],
fragments=fragments,
batch_size=32
)
for batch in ds:
print(batch)
TITLE: LanceDB Deletion File Naming Convention DESCRIPTION: This snippet specifies the naming convention for deletion files in LanceDB, which are used to mark rows for deletion. It details the components of the filename, including fragment ID, read version, and a random ID, along with the file type suffix (Arrow or Roaring Bitmap).
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/format/format.md#_snippet_9
LANGUAGE: text CODE:
_deletions/{fragment_id}-{read_version}-{random_id}.{arrow|bin}
TITLE: Convert NumPy BFloat16 Array to Lance Extension Arrays
DESCRIPTION: This snippet demonstrates how to convert an existing NumPy array of bfloat16
dtype into Lance's PandasBFloat16Array
or BFloat16Array
. It showcases the interoperability between NumPy's ml_dtypes
and Lance's extension arrays, facilitating data integration.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/arrays.md#_snippet_2
LANGUAGE: python CODE:
import numpy as np
from ml_dtypes import bfloat16
from lance.arrow import PandasBFloat16Array, BFloat16Array
np_array = np.array([1.1, 2.1, 3.4], dtype=bfloat16)
PandasBFloat16Array.from_numpy(np_array)
# <PandasBFloat16Array>
# [1.1015625, 2.09375, 3.40625]
# Length: 3, dtype: lance.bfloat16
BFloat16Array.from_numpy(np_array)
# <lance.arrow.BFloat16Array object at 0x...>
# [
# 1.1015625,
# 2.09375,
# 3.40625
# ]
TITLE: Rename Nested Columns in LanceDB Dataset
DESCRIPTION: This snippet demonstrates how to rename nested columns within a LanceDB dataset using lance.LanceDataset.alter_columns
. It shows how to specify nested paths using dot notation (e.g., 'meta.id') and verifies the renaming by printing the dataset's content.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/data_evolution.md#_snippet_6
LANGUAGE: python CODE:
data = [
{"meta": {"id": 1, "name": "Alice"}},
{"meta": {"id": 2, "name": "Bob"}}
]
schema = pa.schema([
("meta", pa.struct([
("id", pa.int32()),
("name", pa.string()),
]))
])
dataset = lance.write_dataset(data, "nested_rename")
dataset.alter_columns({"path": "meta.id", "name": "new_id"})
print(dataset.to_table().to_pandas())
# meta
# 0 {'new_id': 1, 'name': 'Alice'}
# 1 {'new_id': 2, 'name': 'Bob'}
TITLE: Delete Rows from Lance Dataset by SQL Filter
DESCRIPTION: Explains how to delete rows from a Lance dataset using a SQL-like filter expression with the LanceDataset.delete
method. Note that this operation creates a new version of the dataset, requiring it to be reopened to see changes.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/read_and_write.md#_snippet_3
LANGUAGE: python CODE:
import lance
dataset = lance.dataset("./alice_and_bob.lance")
dataset.delete("name = 'Bob'")
dataset2 = lance.dataset("./alice_and_bob.lance")
print(dataset2.to_table().to_pandas())
# name age
# 0 Alice 20