43 KiB
Based on your request, here is a copy of the provided code snippets filtered to include only those relevant to the Rust ecosystem. This includes snippets written in Rust, shell commands for building or testing Rust code (e.g., using cargo
or maturin
), and configurations for native development tools like lldb
.
======================== CODE SNIPPETS
TITLE: Perform Python development installation
DESCRIPTION: These commands navigate into the python
directory and perform a development installation of the Lance Python bindings. This allows developers to import and test changes to the Python wrapper directly.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_1
LANGUAGE: bash CODE:
cd python
maturin develop
TITLE: Install Python bindings build tool
DESCRIPTION: This command installs maturin
, a tool essential for building Python packages that integrate with Rust code. It's a prerequisite for setting up the Python development environment for Lance.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_0
LANGUAGE: bash CODE:
pip install maturin
TITLE: Install Linux Perf Tools and Configure Kernel Parameters
DESCRIPTION: Installs necessary Linux performance tools (perf
) on Ubuntu systems and configures the perf_event_paranoid
kernel parameter. This setup is crucial for allowing non-root users to collect performance data using tools like perf
and flamegraph
.
SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_4
LANGUAGE: sh CODE:
sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`
sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoid"
TITLE: Run Rust unit tests DESCRIPTION: This command executes the unit tests for the Rust core format. Running these tests verifies the correctness of the Rust implementation.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_6
LANGUAGE: bash CODE:
cargo test
TITLE: Profile a LanceDB benchmark using flamegraph
DESCRIPTION: Generates a flamegraph for a specific benchmark using cargo-flamegraph
, aiding in performance analysis. It's recommended to run benchmarks once beforehand to avoid setup time being captured in the profile.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_14
LANGUAGE: shell CODE:
flamegraph -F 100 --no-inline -- $(which python) \
-m pytest python/benchmarks \
--benchmark-min-time=2 \
-k test_ivf_pq_index_search
TITLE: Install Flamegraph Tool
DESCRIPTION: Installs the flamegraph
profiling tool using Cargo, Rust's package manager. This tool is essential for visualizing CPU usage and call stacks as flame graphs for performance analysis.
SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_3
LANGUAGE: sh CODE:
cargo install flamegraph
TITLE: Install Lance Build Dependencies on Ubuntu DESCRIPTION: This command installs necessary system-level dependencies for building Lance on Ubuntu 22.04, including protobuf, SSL development libraries, and general build tools.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_0
LANGUAGE: bash CODE:
sudo apt install protobuf-compiler libssl-dev build-essential pkg-config gfortran
TITLE: Build Rust core format (release) DESCRIPTION: This command compiles the Rust core format in release mode. The release build is optimized for performance and is suitable for production deployments or benchmarking.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_5
LANGUAGE: bash CODE:
cargo build -r
TITLE: Debug Python Script with LLDB DESCRIPTION: Demonstrates how to start an LLDB debugging session for a Python script. It involves launching LLDB with the Python interpreter from a virtual environment and then running the target script within the LLDB prompt.
SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_2
LANGUAGE: sh CODE:
$ lldb ./venv/bin/python
(lldb) r script.py
TITLE: Install Lance Build Dependencies on Mac DESCRIPTION: This command installs the protobuf compiler using Homebrew, a required dependency for building Lance on macOS.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_1
LANGUAGE: bash CODE:
brew install protobuf
TITLE: Configure LLDB Initialization Settings
DESCRIPTION: Sets up basic LLDB initialization settings in the ~/.lldbinit
file. This includes configuring the number of source code lines to display before and after a stop, and enabling the loading of .lldbinit
files from the current working directory.
SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_0
LANGUAGE: lldb CODE:
# ~/.lldbinit
settings set stop-line-count-before 15
settings set stop-line-count-after 15
settings set target.load-cwd-lldbinit true
TITLE: Complete Lance Dataset Write and Read Example in Rust
DESCRIPTION: This Rust main
function provides a complete example demonstrating the usage of write_dataset
and read_dataset
functions. It sets up the necessary arrow
and lance
imports, defines a temporary data path, and orchestrates the writing and subsequent reading of a Lance dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_2
LANGUAGE: Rust CODE:
use arrow::array::UInt32Array;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::{RecordBatch, RecordBatchIterator};
use futures::StreamExt;
use lance::dataset::{WriteMode, WriteParams};
use lance::Dataset;
use std::sync::Arc;
#[tokio::main]
async fn main() {
let data_path: &str = "./temp_data.lance";
write_dataset(data_path).await;
read_dataset(data_path).await;
}
TITLE: Rust: Main Workflow for WikiText to LanceDB Ingestion
DESCRIPTION: This comprehensive example demonstrates the full data ingestion pipeline in Rust. It initializes a Tokio runtime, loads a tokenizer, sets up the Hugging Face API to download WikiText Parquet files, processes them into a WikiTextBatchReader
, and finally writes the data to a Lance dataset. It also includes verification of the created dataset.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_2
LANGUAGE: Rust CODE:
fn main() -> Result<(), Box<dyn Error + Send + Sync>> {
let rt = tokio::runtime::Runtime::new()?;
rt.block_on(async {
// Load tokenizer
let tokenizer = load_tokenizer("gpt2")?;
// Set up Hugging Face API
// Download from https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-raw-v1
let api = Api::new()?;
let repo = api.repo(Repo::with_revision(
"Salesforce/wikitext".into(),
RepoType::Dataset,
"main".into(),
));
// Define the parquet files we want to download
let train_files = vec![
"wikitext-103-raw-v1/train-00000-of-00002.parquet",
"wikitext-103-raw-v1/train-00001-of-00002.parquet",
];
let mut parquet_readers = Vec::new();
for file in &train_files {
println!("Downloading file: {}", file);
let file_path = repo.get(file)?;
let data = std::fs::read(file_path)?;
// Create a temporary file in the system temp directory and write the downloaded data to it
let mut temp_file = NamedTempFile::new()?;
temp_file.write_all(&data)?;
// Create the parquet reader builder with a larger batch size
let builder = ParquetRecordBatchReaderBuilder::try_new(temp_file.into_file())?
.with_batch_size(8192); // Increase batch size for better performance
parquet_readers.push(builder);
}
if parquet_readers.is_empty() {
println!("No parquet files found to process.");
return Ok(());
}
// Create batch reader
let num_samples: u64 = 500_000;
let batch_reader = WikiTextBatchReader::new(parquet_readers, tokenizer, Some(num_samples))?;
// Save as Lance dataset
println!("Writing to Lance dataset...");
let lance_dataset_path = "rust_wikitext_lance_dataset.lance";
let write_params = WriteParams::default();
lance::Dataset::write(batch_reader, lance_dataset_path, Some(write_params)).await?;
// Verify the dataset
let ds = lance::Dataset::open(lance_dataset_path).await?;
let scanner = ds.scan();
let mut stream = scanner.try_into_stream().await?;
let mut total_rows = 0;
while let Some(batch_result) = stream.next().await {
let batch = batch_result?;
total_rows += batch.num_rows();
}
println!(
"Lance dataset created successfully with {} rows",
total_rows
);
println!("Dataset location: {}", lance_dataset_path);
Ok(())
})
}
TITLE: Build and Test Pylance Python Package DESCRIPTION: These commands set up a Python virtual environment, install maturin for Rust-Python binding, build the Pylance package in debug mode, and then run its associated tests.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_3
LANGUAGE: bash CODE:
cd python
python3 -m venv venv
source venv/bin/activate
pip install maturin
# Build debug build
maturin develop --extras tests
# Run pytest
pytest python/tests/
TITLE: Install Lance using Cargo DESCRIPTION: Installs the Lance Rust library as a command-line tool using the Cargo package manager.
SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_0
LANGUAGE: shell CODE:
cargo install lance
TITLE: Build pylance in release mode for benchmarks
DESCRIPTION: Builds the pylance
module in release mode with debug symbols, enabling benchmark execution and profiling. It includes benchmark-specific extras and features for data generation.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_10
LANGUAGE: shell CODE:
maturin develop --profile release-with-debug --extras benchmarks --features datagen
TITLE: Query Lance Dataset with Simple SQL in Rust DataFusion
DESCRIPTION: This Rust example demonstrates how to register a Lance dataset as a table in DataFusion using LanceTableProvider
and execute a simple SQL SELECT
query to retrieve the first 10 rows. It shows the basic setup for integrating Lance with DataFusion.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_0
LANGUAGE: rust CODE:
use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;
let ctx = SessionContext::new();
ctx.register_table("dataset",
Arc::new(LanceTableProvider::new(
Arc::new(dataset.clone()),
/* with_row_id */ false,
/* with_row_addr */ false,
)))?;
let df = ctx.sql("SELECT * FROM dataset LIMIT 10").await?;
let result = df.collect().await?;
TITLE: Run LanceDB code formatters
DESCRIPTION: Applies code formatting rules to the entire project. Specific commands like make format-python
or cargo fmt
can be used for language-specific formatting.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_4
LANGUAGE: shell CODE:
make format
TITLE: Build and Search HNSW Index for Vector Similarity in Rust
DESCRIPTION: This Rust code provides a complete example for vector similarity search. It defines a ground_truth
function for L2 distance calculation, create_test_vector_dataset
to generate synthetic fixed-size list vectors, and a main
function that orchestrates the process. The main
function generates or loads a dataset, builds an HNSW index using lance_index::vector::hnsw
, and then performs vector searches, measuring construction and search times, and calculating recall against ground truth.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/hnsw.md#_snippet_0
LANGUAGE: Rust CODE:
use std::collections::HashSet;
use std::sync::Arc;
use arrow::array::{types::Float32Type, Array, FixedSizeListArray};
use arrow::array::{AsArray, FixedSizeListBuilder, Float32Builder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchIterator;
use arrow_select::concat::concat;
use futures::stream::StreamExt;
use lance::Dataset;
use lance_index::vector::v3::subindex::IvfSubIndex;
use lance_index::vector::{
flat::storage::FlatFloatStorage,
hnsw::{builder::HnswBuildParams, HNSW},
};
use lance_linalg::distance::DistanceType;
fn ground_truth(fsl: &FixedSizeListArray, query: &[f32], k: usize) -> HashSet<u32> {
let mut dists = vec![];
for i in 0..fsl.len() {
let dist = lance_linalg::distance::l2_distance(
query,
fsl.value(i).as_primitive::<Float32Type>().values(),
);
dists.push((dist, i as u32));
}
dists.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap());
dists.truncate(k);
dists.into_iter().map(|(_, i)| i).collect()
}
pub async fn create_test_vector_dataset(output: &str, num_rows: usize, dim: i32) {
let schema = Arc::new(Schema::new(vec![Field::new(
"vector",
DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), dim),
false,
)]));
let mut batches = Vec::new();
// Create a few batches
for _ in 0..2 {
let v_builder = Float32Builder::new();
let mut list_builder = FixedSizeListBuilder::new(v_builder, dim);
for _ in 0..num_rows {
for _ in 0..dim {
list_builder.values().append_value(rand::random::<f32>());
}
list_builder.append(true);
}
let array = Arc::new(list_builder.finish());
let batch = RecordBatch::try_new(schema.clone(), vec![array]).unwrap();
batches.push(batch);
}
let batch_reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema.clone());
println!("Writing dataset to {}", output);
Dataset::write(batch_reader, output, None).await.unwrap();
}
#[tokio::main]
async fn main() {
let uri: Option<String> = None; // None means generate test data
let column = "vector";
let ef = 100;
let max_edges = 30;
let max_level = 7;
// 1. Generate a synthetic test data of specified dimensions
let dataset = if uri.is_none() {
println!("No uri is provided, generating test dataset...");
let output = "test_vectors.lance";
create_test_vector_dataset(output, 1000, 64).await;
Dataset::open(output).await.expect("Failed to open dataset")
} else {
Dataset::open(uri.as_ref().unwrap())
.await
.expect("Failed to open dataset")
};
println!("Dataset schema: {:#?}", dataset.schema());
let batches = dataset
.scan()
.project(&[column])
.unwrap()
.try_into_stream()
.await
.unwrap()
.then(|batch| async move { batch.unwrap().column_by_name(column).unwrap().clone() })
.collect::<Vec<_>>()
.await;
let arrs = batches.iter().map(|b| b.as_ref()).collect::<Vec<_>>();
let fsl = concat(&arrs).unwrap().as_fixed_size_list().clone();
println!("Loaded {:?} batches", fsl.len());
let vector_store = Arc::new(FlatFloatStorage::new(fsl.clone(), DistanceType::L2));
let q = fsl.value(0);
let k = 10;
let gt = ground_truth(&fsl, q.as_primitive::<Float32Type>().values(), k);
for ef_construction in [15, 30, 50] {
let now = std::time::Instant::now();
// 2. Build a hierarchical graph structure for efficient vector search using Lance API
let hnsw = HNSW::index_vectors(
vector_store.as_ref(),
HnswBuildParams::default()
.max_level(max_level)
.num_edges(max_edges)
.ef_construction(ef_construction),
)
.unwrap();
let construct_time = now.elapsed().as_secs_f32();
let now = std::time::Instant::now();
// 3. Perform vector search with different parameters and compute the ground truth using L2 distance search
let results: HashSet<u32> = hnsw
.search_basic(q.clone(), k, ef, None, vector_store.as_ref())
.unwrap()
.iter()
.map(|node| node.id)
.collect();
let search_time = now.elapsed().as_micros();
println!(
"level={}, ef_construct={}, ef={} recall={}: construct={:.3}s search={:.3} us",
max_level,
ef_construction,
ef,
results.intersection(>).count() as f32 / k as f32,
construct_time,
search_time
);
}
}
TITLE: Compare LanceDB benchmarks against previous version
DESCRIPTION: Provides a sequence of commands to compare the performance of the current version against the main
branch. This involves saving a baseline from main
and then comparing the current branch's performance against it.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_15
LANGUAGE: shell CODE:
CURRENT_BRANCH=$(git branch --show-current)
LANGUAGE: shell CODE:
git checkout main
LANGUAGE: shell CODE:
maturin develop --profile release-with-debug --features datagen
LANGUAGE: shell CODE:
pytest --benchmark-save=baseline python/benchmarks -m "not slow"
LANGUAGE: shell CODE:
COMPARE_ID=$(ls .benchmarks/*/ | tail -1 | cut -c1-4)
LANGUAGE: shell CODE:
git checkout $CURRENT_BRANCH
LANGUAGE: shell CODE:
maturin develop --profile release-with-debug --features datagen
LANGUAGE: shell CODE:
pytest --benchmark-compare=$COMPARE_ID python/benchmarks -m "not slow"
TITLE: Build Rust core format (debug) DESCRIPTION: This command compiles the Rust core format in debug mode. The debug build includes debugging information and is suitable for development and testing, though it is not optimized for performance.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_4
LANGUAGE: bash CODE:
cargo build
TITLE: Format and lint Rust code
DESCRIPTION: These commands are used to automatically format Rust code according to community standards (cargo fmt
) and to perform static analysis for potential issues (cargo clippy
). This ensures code quality and consistency.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_3
LANGUAGE: bash CODE:
cargo fmt --all
cargo clippy --all-features --tests --benches
TITLE: Run LanceDB code linters
DESCRIPTION: Executes code linters to check for style violations and potential issues. Language-specific linting can be performed with make lint-python
or make lint-rust
.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_5
LANGUAGE: shell CODE:
make lint
TITLE: Clean LanceDB build artifacts DESCRIPTION: Removes all generated build artifacts and temporary files from the project directory, useful for a clean rebuild.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_9
LANGUAGE: shell CODE:
make clean
TITLE: Rust: Load Tokenizer from Hugging Face Hub
DESCRIPTION: This function provides a utility to load a tokenizer from the Hugging Face Hub. It takes a model name, creates an API client, retrieves the tokenizer file from the specified repository, and constructs a Tokenizer
object from it. This is a common pattern for integrating Hugging Face models.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_3
LANGUAGE: Rust CODE:
fn load_tokenizer(model_name: &str) -> Result<Tokenizer, Box<dyn Error + Send + Sync>> {
let api = Api::new()?;
let repo = api.repo(Repo::with_revision(
model_name.into(),
RepoType::Model,
"main".into(),
));
let tokenizer_path = repo.get("tokenizer.json")?;
let tokenizer = Tokenizer::from_file(tokenizer_path)?;
Ok(tokenizer)
}
TITLE: Build MacOS x86_64 Wheels
DESCRIPTION: This command builds release-mode wheels specifically for x86_64 MacOS. It uses maturin
to compile the project for the x86_64-apple-darwin
target, storing the resulting wheels in the 'wheels' directory.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_26
LANGUAGE: Shell CODE:
maturin build --release \
--target x86_64-apple-darwin \
--out wheels
TITLE: Build and Test Lance Rust Package DESCRIPTION: These commands clone the Lance repository, navigate to the Rust directory, and then build, test, and benchmark the core Rust components of Lance.
SOURCE: https://github.com/lancedb/lance/blob/main/wiki/How-to-Build.md#_snippet_2
LANGUAGE: bash CODE:
git checkout https://github.com/lancedb/lance.git
# Build rust package
cd rust
cargo build
# Run test
cargo test
# Run benchmarks
cargo bench
TITLE: Build LanceDB in development mode
DESCRIPTION: Builds the Rust native module in place using maturin
. This command needs to be re-run whenever Rust code changes, but is not required for Python code modifications.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_0
LANGUAGE: shell CODE:
maturin develop
TITLE: Download Lindera Language Model
DESCRIPTION: Command-line instruction to download a specific Lindera language model (e.g., ipadic, ko-dic, unidic) for LanceDB. Note that lindera-cli
must be installed beforehand as Lindera models require compilation.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_4
LANGUAGE: bash CODE:
python -m lance.download lindera -l [ipadic|ko-dic|unidic]
TITLE: Decorate Rust Unit Test for Tracing
DESCRIPTION: To enable tracing for a Rust unit test, decorate it with the #[lance_test_macros::test]
attribute. This macro wraps any existing test attributes, allowing tracing information to be collected during test execution.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_16
LANGUAGE: Rust CODE:
#[lance_test_macros::test(tokio::test)]
async fn test() {
...
}
TITLE: Add Rust Toolchain Targets for Cross-Compilation DESCRIPTION: To build manylinux wheels for different Linux architectures, you must first add the corresponding Rust toolchain targets. These commands add the x86_64 and aarch64 GNU targets, enabling cross-compilation.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_22
LANGUAGE: Shell CODE:
rustup target add x86_64-unknown-linux-gnu
rustup target add aarch64-unknown-linux-gnu
TITLE: Build MacOS ARM64 Wheels
DESCRIPTION: This command builds release-mode wheels specifically for ARM64 (aarch64) MacOS. It uses maturin
to compile the project for the aarch64-apple-darwin
target, storing the resulting wheels in the 'wheels' directory.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_25
LANGUAGE: Shell CODE:
maturin build --release \
--target aarch64-apple-darwin \
--out wheels
TITLE: Rust: WikiTextBatchReader Next Batch Logic
DESCRIPTION: This snippet shows the core logic for the next
method of the WikiTextBatchReader
. It attempts to build and retrieve the next Parquet reader from a list of available readers. If a reader is successfully built, it's used; otherwise, it handles errors or indicates that no more readers are available.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_1
LANGUAGE: Rust CODE:
if let Some(builder) = self.parquet_readers[self.current_reader_idx].take() {
match builder.build() {
Ok(reader) => {
self.current_reader = Some(Box::new(reader));
self.current_reader_idx += 1;
continue;
}
Err(e) => {
return Some(Err(arrow::error::ArrowError::ExternalError(Box::new(e))))
}
}
}
}
// No more readers available
return None;
}
TITLE: Run Rust Unit Test with Tracing Verbosity
DESCRIPTION: Execute a Rust unit test with tracing enabled by setting the LANCE_TESTING
environment variable to a desired verbosity level (e.g., 'debug', 'info'). This command will generate a JSON trace file in your working directory, which can be viewed in Chrome or Perfetto.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_17
LANGUAGE: Bash CODE:
LANCE_TESTING=debug cargo test dataset::tests::test_create_dataset
TITLE: Build Linux x86_64 Manylinux Wheels
DESCRIPTION: This command builds release-mode manylinux wheels for x86_64 Linux. It utilizes maturin
with zig
for cross-compilation, targeting manylinux2014
compatibility, and outputs the generated wheels to the 'wheels' directory.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_23
LANGUAGE: Shell CODE:
maturin build --release --zig \
--target x86_64-unknown-linux-gnu \
--compatibility manylinux2014 \
--out wheels
TITLE: Build Linux ARM64 Manylinux Wheels
DESCRIPTION: This command builds release-mode manylinux wheels for ARM64 (aarch64) Linux. It uses maturin
with zig
for cross-compilation, targeting manylinux2014
compatibility, and places the output wheels in the 'wheels' directory.
SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_24
LANGUAGE: Shell CODE:
maturin build --release --zig \
--target aarch_64-unknown-linux-gnu \
--compatibility manylinux2014 \
--out wheels
TITLE: Join Multiple Lance Datasets with SQL in Rust DataFusion
DESCRIPTION: This Rust example illustrates how to register multiple Lance datasets (e.g., 'orders' and 'customers') as separate tables in DataFusion. It then performs a SQL JOIN
operation between these tables to combine data based on a common key, demonstrating more complex query capabilities.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_1
LANGUAGE: rust CODE:
use datafusion::prelude::SessionContext;
use crate::datafusion::LanceTableProvider;
let ctx = SessionContext::new();
ctx.register_table("orders",
Arc::new(LanceTableProvider::new(
Arc::new(orders_dataset.clone()),
/* with_row_id */ false,
/* with_row_addr */ false,
)))?;
ctx.register_table("customers",
Arc::new(LanceTableProvider::new(
Arc::new(customers_dataset.clone()),
/* with_row_id */ false,
/* with_row_addr */ false,
)))?;
let df = ctx.sql("
SELECT o.order_id, o.amount, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
LIMIT 10
").await?;
let result = df.collect().await?;
TITLE: Generate Flame Graph from Process ID DESCRIPTION: Generates a flame graph for a running process using its Process ID (PID). This command is used to capture and visualize CPU profiles, helping to identify performance bottlenecks in an application.
SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_5
LANGUAGE: sh CODE:
flamegraph -p <PID>
TITLE: Clone LanceDB GitHub Repository DESCRIPTION: Instructions to clone the LanceDB project repository from GitHub to a local machine. This is the first step for setting up the development environment.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_11
LANGUAGE: shell CODE:
git clone https://github.com/lancedb/lance.git
TITLE: Rust Implementation of WikiTextBatchReader
DESCRIPTION: This Rust code defines WikiTextBatchReader
, a custom implementation of arrow::record_batch::RecordBatchReader
. It's designed to read text data from Parquet files, tokenize it using a Tokenizer
from the tokenizers
crate, and transform it into Arrow RecordBatch
es. The process_batch
method handles tokenization, limits the number of samples, and shuffles the tokenized IDs before creating the final RecordBatch
.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_0
LANGUAGE: rust CODE:
use arrow::array::{Array, Int64Builder, ListBuilder, UInt32Array};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use arrow::record_batch::RecordBatchReader;
use futures::StreamExt;
use hf_hub::{api::sync::Api, Repo, RepoType};
use lance::dataset::WriteParams;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use rand::seq::SliceRandom;
use rand::SeedableRng;
use std::error::Error;
use std::fs::File;
use std::io::Write;
use std::sync::Arc;
use tempfile::NamedTempFile;
use tokenizers::Tokenizer;
// Implement a custom stream batch reader
struct WikiTextBatchReader {
schema: Arc<Schema>,
parquet_readers: Vec<Option<ParquetRecordBatchReaderBuilder<File>>>,
current_reader_idx: usize,
current_reader: Option<Box<dyn RecordBatchReader + Send>>,
tokenizer: Tokenizer,
num_samples: u64,
cur_samples_cnt: u64,
}
impl WikiTextBatchReader {
fn new(
parquet_readers: Vec<ParquetRecordBatchReaderBuilder<File>>,
tokenizer: Tokenizer,
num_samples: Option<u64>,
) -> Result<Self, Box<dyn Error + Send + Sync>> {
let schema = Arc::new(Schema::new(vec![Field::new(
"input_ids",
DataType::List(Arc::new(Field::new("item", DataType::Int64, true))),
false,
)]));
Ok(Self {
schema,
parquet_readers: parquet_readers.into_iter().map(Some).collect(),
current_reader_idx: 0,
current_reader: None,
tokenizer,
num_samples: num_samples.unwrap_or(100_000),
cur_samples_cnt: 0,
})
}
fn process_batch(
&mut self,
input_batch: &RecordBatch,
) -> Result<RecordBatch, arrow::error::ArrowError> {
let num_rows = input_batch.num_rows();
let mut token_builder = ListBuilder::new(Int64Builder::with_capacity(num_rows * 1024)); // Pre-allocate space
let mut should_break = false;
let column = input_batch.column_by_name("text").unwrap();
let string_array = column
.as_any()
.downcast_ref::<arrow::array::StringArray>()
.unwrap();
for i in 0..num_rows {
if self.cur_samples_cnt >= self.num_samples {
should_break = true;
break;
}
if !Array::is_null(string_array, i) {
let text = string_array.value(i);
// Split paragraph into lines
for line in text.split('
') {
if let Ok(encoding) = self.tokenizer.encode(line, true) {
let tb_values = token_builder.values();
for &id in encoding.get_ids() {
tb_values.append_value(id as i64);
}
token_builder.append(true);
self.cur_samples_cnt += 1;
if self.cur_samples_cnt % 5000 == 0 {
println!("Processed {} rows", self.cur_samples_cnt);
}
if self.cur_samples_cnt >= self.num_samples {
should_break = true;
break;
}
}
}
}
}
// Create array and shuffle it
let input_ids_array = token_builder.finish();
// Create shuffled array by randomly sampling indices
let mut rng = rand::rngs::StdRng::seed_from_u64(1337);
let len = input_ids_array.len();
let mut indices: Vec<u32> = (0..len as u32).collect();
indices.shuffle(&mut rng);
// Take values in shuffled order
let indices_array = UInt32Array::from(indices);
let shuffled = arrow::compute::take(&input_ids_array, &indices_array, None)?;
let batch = RecordBatch::try_new(self.schema.clone(), vec![Arc::new(shuffled)]);
if should_break {
println!("Stop at {} rows", self.cur_samples_cnt);
self.parquet_readers.clear();
self.current_reader = None;
}
batch
}
}
impl RecordBatchReader for WikiTextBatchReader {
fn schema(&self) -> Arc<Schema> {
self.schema.clone()
}
}
impl Iterator for WikiTextBatchReader {
type Item = Result<RecordBatch, arrow::error::ArrowError>;
fn next(&mut self) -> Option<Self::Item> {
loop {
// If we have a current reader, try to get next batch
if let Some(reader) = &mut self.current_reader {
if let Some(batch_result) = reader.next() {
return Some(batch_result.and_then(|batch| self.process_batch(&batch)));
}
}
// If no current reader or current reader is exhausted, try to get next reader
if self.current_reader_idx < self.parquet_readers.len() {
TITLE: Set DYLD_LIBRARY_PATH for Lance Python Debugging in LLDB
DESCRIPTION: Configures the DYLD_LIBRARY_PATH
environment variable specifically for debugging Lance Python projects within LLDB. This ensures that the dynamic linker can find necessary shared libraries located in the third-party distribution directory.
SOURCE: https://github.com/lancedb/lance/blob/wiki/Debug.md#_snippet_1
LANGUAGE: lldb CODE:
# /path/to/lance/python/.lldbinit
env DYLD_LIBRARY_PATH=/path/to/thirdparty/dist/lib:${DYLD_LIBRARY_PATH}
TITLE: Download and extract MeCab Ipadic model DESCRIPTION: This snippet downloads the gzipped tarball of the MeCab Ipadic model from GitHub and then extracts its contents using tar. This is the first step in preparing the dictionary for building.
SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_0
LANGUAGE: bash CODE:
curl -L -o mecab-ipadic-2.7.0-20070801.tar.gz "https://github.com/lindera-morphology/mecab-ipadic/archive/refs/tags/2.7.0-20070801.tar.gz"
tar xvf mecab-ipadic-2.7.0-20070801.tar.gz
TITLE: Build user dictionary with Lindera DESCRIPTION: This command demonstrates how to build a custom user dictionary using 'lindera build'. It takes a CSV file as input and creates a new user dictionary, which can be used to extend the base language model.
SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_2
LANGUAGE: bash CODE:
lindera build --build-user-dictionary --dictionary-kind=ipadic user_dict/userdict.csv user_dict2
TITLE: Download Jieba Language Model DESCRIPTION: Command-line instruction to download the Jieba language model for use with LanceDB. The model will be automatically stored in the default Jieba model directory within the configured language model home.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_1
LANGUAGE: bash CODE:
python -m lance.download jieba
TITLE: Read and Inspect Lance Dataset in Rust
DESCRIPTION: This Rust function read_dataset
shows how to open an existing Lance dataset from a given path. It uses a scanner
to create a batch_stream
and then iterates through each RecordBatch
, printing its number of rows, columns, schema, and the entire batch content.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_1
LANGUAGE: Rust CODE:
// Reads dataset from the given path and prints batch size, schema for all record batches. Also extracts and prints a slice from the first batch
async fn read_dataset(data_path: &str) {
let dataset = Dataset::open(data_path).await.unwrap();
let scanner = dataset.scan();
let mut batch_stream = scanner.try_into_stream().await.unwrap().map(|b| b.unwrap());
while let Some(batch) = batch_stream.next().await {
println!("Batch size: {}, {}", batch.num_rows(), batch.num_columns()); // print size of batch
println!("Schema: {:?}", batch.schema()); // print schema of recordbatch
println!("Batch: {:?}", batch); // print the entire recordbatch (schema and data)
}
} // End read dataset
TITLE: Create a Lance Dataset from Arrow RecordBatches in Rust DESCRIPTION: Demonstrates how to write a collection of Arrow RecordBatches and an Arrow Schema into a new Lance Dataset. It uses default write parameters and an iterator for the batches.
SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_1
LANGUAGE: rust CODE:
use lance::{dataset::WriteParams, Dataset};
let write_params = WriteParams::default();
let mut reader = RecordBatchIterator::new(
batches.into_iter().map(Ok),
schema
);
Dataset::write(reader, &uri, Some(write_params)).await.unwrap();
TITLE: Build Ipadic language model with Lindera DESCRIPTION: This command uses the 'lindera build' tool to compile the Ipadic dictionary. It specifies the dictionary kind as 'ipadic' and points to the extracted model directory to create the main dictionary.
SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_1
LANGUAGE: bash CODE:
lindera build --dictionary-kind=ipadic mecab-ipadic-2.7.0-20070801 main
TITLE: Write Lance Dataset in Rust
DESCRIPTION: This Rust function write_dataset
demonstrates how to create and write a Lance dataset to a specified path. It defines a schema with UInt32
fields, creates a RecordBatch
with sample data, and uses WriteParams
to set the write mode to Overwrite
before writing the dataset to disk.
SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_0
LANGUAGE: Rust CODE:
// Writes sample dataset to the given path
async fn write_dataset(data_path: &str) {
// Define new schema
let schema = Arc::new(Schema::new(vec![
Field::new("key", DataType::UInt32, false),
Field::new("value", DataType::UInt32, false),
]));
// Create new record batches
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(UInt32Array::from(vec![1, 2, 3, 4, 5, 6])),
Arc::new(UInt32Array::from(vec![6, 7, 8, 9, 10, 11])),
],
)
.unwrap();
let batches = RecordBatchIterator::new([Ok(batch)], schema.clone());
// Define write parameters (e.g. overwrite dataset)
let write_params = WriteParams {
mode: WriteMode::Overwrite,
..Default::default()
};
Dataset::write(batches, data_path, Some(write_params))
.await
.unwrap();
} // End write dataset
TITLE: Build LanceDB Rust JNI Module DESCRIPTION: Specifies the command to build only the Rust-based JNI (Java Native Interface) module of LanceDB. This is useful for developers focusing on the native components without rebuilding the entire Java project.
SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_10
LANGUAGE: shell CODE:
cargo build
TITLE: Read a Lance Dataset and Collect RecordBatches in Rust DESCRIPTION: Opens an existing Lance Dataset from a specified path, scans its content, and collects all resulting RecordBatches into a vector. Error handling is included.
SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_2
LANGUAGE: rust CODE:
let dataset = Dataset::open(path).await.unwrap();
let mut scanner = dataset.scan();
let batches: Vec<RecordBatch> = scanner
.try_into_stream()
.await
.unwrap()
.map(|b| b.unwrap())
.collect::<Vec<RecordBatch>>()
.await;
TITLE: Create a Vector Index on a Lance Dataset in Rust DESCRIPTION: Demonstrates how to create a vector index on a specified column (e.g., 'embeddings') within a Lance Dataset. It configures vector index parameters like the number of partitions and sub-vectors, noting potential alignment requirements.
SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_4
LANGUAGE: rust CODE:
use ::lance::index::vector::VectorIndexParams;
let params = VectorIndexParams::default();
params.num_partitions = 256;
params.num_sub_vectors = 16;
// this will Err if list_size(embeddings) / num_sub_vectors does not meet simd alignment
dataset.create_index(&["embeddings"], IndexType::Vector, None, ¶ms, true).await;
TITLE: Retrieve Specific Records from a Lance Dataset in Rust DESCRIPTION: Retrieves specific records from a Lance Dataset based on their indices and a projection. The result is a RecordBatch containing the requested data.
SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_3
LANGUAGE: rust CODE:
let values: Result<RecordBatch> = dataset.take(&[200, 199, 39, 40, 100], &projection).await;