Based on your request, here is a copy of the provided code snippets filtered to include only those relevant to the Rust ecosystem. This includes snippets written in Rust, shell commands for building or testing Rust code (e.g., using `cargo` or `maturin`), and configurations for native development tools like `lldb`. ======================== CODE SNIPPETS ======================== TITLE: Perform Python development installation DESCRIPTION: These commands navigate into the `python` directory and perform a development installation of the Lance Python bindings. This allows developers to import and test changes to the Python wrapper directly. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_1 LANGUAGE: bash CODE: ``` cd python maturin develop ``` ---------------------------------------- TITLE: Install Python bindings build tool DESCRIPTION: This command installs `maturin`, a tool essential for building Python packages that integrate with Rust code. It's a prerequisite for setting up the Python development environment for Lance. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_0 LANGUAGE: bash CODE: ``` pip install maturin ``` ---------------------------------------- TITLE: Install Linux Perf Tools and Configure Kernel Parameters DESCRIPTION: Installs necessary Linux performance tools (`perf`) on Ubuntu systems and configures the `perf_event_paranoid` kernel parameter. This setup is crucial for allowing non-root users to collect performance data using tools like `perf` and `flamegraph`. SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_4 LANGUAGE: sh CODE: ``` sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r` sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoid" ``` ---------------------------------------- TITLE: Run Rust unit tests DESCRIPTION: This command executes the unit tests for the Rust core format. Running these tests verifies the correctness of the Rust implementation. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_6 LANGUAGE: bash CODE: ``` cargo test ``` ---------------------------------------- TITLE: Profile a LanceDB benchmark using flamegraph DESCRIPTION: Generates a flamegraph for a specific benchmark using `cargo-flamegraph`, aiding in performance analysis. It's recommended to run benchmarks once beforehand to avoid setup time being captured in the profile. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_14 LANGUAGE: shell CODE: ``` flamegraph -F 100 --no-inline -- $(which python) \ -m pytest python/benchmarks \ --benchmark-min-time=2 \ -k test_ivf_pq_index_search ``` ---------------------------------------- TITLE: Install Flamegraph Tool DESCRIPTION: Installs the `flamegraph` profiling tool using Cargo, Rust's package manager. This tool is essential for visualizing CPU usage and call stacks as flame graphs for performance analysis. SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_3 LANGUAGE: sh CODE: ``` cargo install flamegraph ``` ---------------------------------------- TITLE: Install Lance Build Dependencies on Ubuntu DESCRIPTION: This command installs necessary system-level dependencies for building Lance on Ubuntu 22.04, including protobuf, SSL development libraries, and general build tools. SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_0 LANGUAGE: bash CODE: ``` sudo apt install protobuf-compiler libssl-dev build-essential pkg-config gfortran ``` ---------------------------------------- TITLE: Build Rust core format (release) DESCRIPTION: This command compiles the Rust core format in release mode. The release build is optimized for performance and is suitable for production deployments or benchmarking. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_5 LANGUAGE: bash CODE: ``` cargo build -r ``` ---------------------------------------- TITLE: Debug Python Script with LLDB DESCRIPTION: Demonstrates how to start an LLDB debugging session for a Python script. It involves launching LLDB with the Python interpreter from a virtual environment and then running the target script within the LLDB prompt. SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_2 LANGUAGE: sh CODE: ``` $ lldb ./venv/bin/python (lldb) r script.py ``` ---------------------------------------- TITLE: Install Lance Build Dependencies on Mac DESCRIPTION: This command installs the protobuf compiler using Homebrew, a required dependency for building Lance on macOS. SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_1 LANGUAGE: bash CODE: ``` brew install protobuf ``` ---------------------------------------- TITLE: Configure LLDB Initialization Settings DESCRIPTION: Sets up basic LLDB initialization settings in the `~/.lldbinit` file. This includes configuring the number of source code lines to display before and after a stop, and enabling the loading of `.lldbinit` files from the current working directory. SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_0 LANGUAGE: lldb CODE: ``` # ~/.lldbinit settings set stop-line-count-before 15 settings set stop-line-count-after 15 settings set target.load-cwd-lldbinit true ``` ---------------------------------------- TITLE: Complete Lance Dataset Write and Read Example in Rust DESCRIPTION: This Rust `main` function provides a complete example demonstrating the usage of `write_dataset` and `read_dataset` functions. It sets up the necessary `arrow` and `lance` imports, defines a temporary data path, and orchestrates the writing and subsequent reading of a Lance dataset. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_2 LANGUAGE: Rust CODE: ``` use arrow::array::UInt32Array; use arrow::datatypes::{DataType, Field, Schema}; use arrow::record_batch::{RecordBatch, RecordBatchIterator}; use futures::StreamExt; use lance::dataset::{WriteMode, WriteParams}; use lance::Dataset; use std::sync::Arc; #[tokio::main] async fn main() { let data_path: &str = "./temp_data.lance"; write_dataset(data_path).await; read_dataset(data_path).await; } ``` ---------------------------------------- TITLE: Rust: Main Workflow for WikiText to LanceDB Ingestion DESCRIPTION: This comprehensive example demonstrates the full data ingestion pipeline in Rust. It initializes a Tokio runtime, loads a tokenizer, sets up the Hugging Face API to download WikiText Parquet files, processes them into a `WikiTextBatchReader`, and finally writes the data to a Lance dataset. It also includes verification of the created dataset. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_2 LANGUAGE: Rust CODE: ``` fn main() -> Result<(), Box> { let rt = tokio::runtime::Runtime::new()?; rt.block_on(async { // Load tokenizer let tokenizer = load_tokenizer("gpt2")?; // Set up Hugging Face API // Download from https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-raw-v1 let api = Api::new()?; let repo = api.repo(Repo::with_revision( "Salesforce/wikitext".into(), RepoType::Dataset, "main".into(), )); // Define the parquet files we want to download let train_files = vec![ "wikitext-103-raw-v1/train-00000-of-00002.parquet", "wikitext-103-raw-v1/train-00001-of-00002.parquet", ]; let mut parquet_readers = Vec::new(); for file in &train_files { println!("Downloading file: {}", file); let file_path = repo.get(file)?; let data = std::fs::read(file_path)?; // Create a temporary file in the system temp directory and write the downloaded data to it let mut temp_file = NamedTempFile::new()?; temp_file.write_all(&data)?; // Create the parquet reader builder with a larger batch size let builder = ParquetRecordBatchReaderBuilder::try_new(temp_file.into_file())? .with_batch_size(8192); // Increase batch size for better performance parquet_readers.push(builder); } if parquet_readers.is_empty() { println!("No parquet files found to process."); return Ok(()); } // Create batch reader let num_samples: u64 = 500_000; let batch_reader = WikiTextBatchReader::new(parquet_readers, tokenizer, Some(num_samples))?; // Save as Lance dataset println!("Writing to Lance dataset..."); let lance_dataset_path = "rust_wikitext_lance_dataset.lance"; let write_params = WriteParams::default(); lance::Dataset::write(batch_reader, lance_dataset_path, Some(write_params)).await?; // Verify the dataset let ds = lance::Dataset::open(lance_dataset_path).await?; let scanner = ds.scan(); let mut stream = scanner.try_into_stream().await?; let mut total_rows = 0; while let Some(batch_result) = stream.next().await { let batch = batch_result?; total_rows += batch.num_rows(); } println!( "Lance dataset created successfully with {} rows", total_rows ); println!("Dataset location: {}", lance_dataset_path); Ok(()) }) } ``` ---------------------------------------- TITLE: Build and Test Pylance Python Package DESCRIPTION: These commands set up a Python virtual environment, install maturin for Rust-Python binding, build the Pylance package in debug mode, and then run its associated tests. SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_3 LANGUAGE: bash CODE: ``` cd python python3 -m venv venv source venv/bin/activate pip install maturin # Build debug build maturin develop --extras tests # Run pytest pytest python/tests/ ``` ---------------------------------------- TITLE: Install Lance using Cargo DESCRIPTION: Installs the Lance Rust library as a command-line tool using the Cargo package manager. SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_0 LANGUAGE: shell CODE: ``` cargo install lance ``` ---------------------------------------- TITLE: Build pylance in release mode for benchmarks DESCRIPTION: Builds the `pylance` module in release mode with debug symbols, enabling benchmark execution and profiling. It includes benchmark-specific extras and features for data generation. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_10 LANGUAGE: shell CODE: ``` maturin develop --profile release-with-debug --extras benchmarks --features datagen ``` ---------------------------------------- TITLE: Query Lance Dataset with Simple SQL in Rust DataFusion DESCRIPTION: This Rust example demonstrates how to register a Lance dataset as a table in DataFusion using `LanceTableProvider` and execute a simple SQL `SELECT` query to retrieve the first 10 rows. It shows the basic setup for integrating Lance with DataFusion. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_0 LANGUAGE: rust CODE: ``` use datafusion::prelude::SessionContext; use crate::datafusion::LanceTableProvider; let ctx = SessionContext::new(); ctx.register_table("dataset", Arc::new(LanceTableProvider::new( Arc::new(dataset.clone()), /* with_row_id */ false, /* with_row_addr */ false, )))?; let df = ctx.sql("SELECT * FROM dataset LIMIT 10").await?; let result = df.collect().await?; ``` ---------------------------------------- TITLE: Run LanceDB code formatters DESCRIPTION: Applies code formatting rules to the entire project. Specific commands like `make format-python` or `cargo fmt` can be used for language-specific formatting. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_4 LANGUAGE: shell CODE: ``` make format ``` ---------------------------------------- TITLE: Build and Search HNSW Index for Vector Similarity in Rust DESCRIPTION: This Rust code provides a complete example for vector similarity search. It defines a `ground_truth` function for L2 distance calculation, `create_test_vector_dataset` to generate synthetic fixed-size list vectors, and a `main` function that orchestrates the process. The `main` function generates or loads a dataset, builds an HNSW index using `lance_index::vector::hnsw`, and then performs vector searches, measuring construction and search times, and calculating recall against ground truth. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/hnsw.md#_snippet_0 LANGUAGE: Rust CODE: ``` use std::collections::HashSet; use std::sync::Arc; use arrow::array::{types::Float32Type, Array, FixedSizeListArray}; use arrow::array::{AsArray, FixedSizeListBuilder, Float32Builder}; use arrow::datatypes::{DataType, Field, Schema}; use arrow::record_batch::RecordBatch; use arrow::record_batch::RecordBatchIterator; use arrow_select::concat::concat; use futures::stream::StreamExt; use lance::Dataset; use lance_index::vector::v3::subindex::IvfSubIndex; use lance_index::vector::{ flat::storage::FlatFloatStorage, hnsw::{builder::HnswBuildParams, HNSW}, }; use lance_linalg::distance::DistanceType; fn ground_truth(fsl: &FixedSizeListArray, query: &[f32], k: usize) -> HashSet { let mut dists = vec![]; for i in 0..fsl.len() { let dist = lance_linalg::distance::l2_distance( query, fsl.value(i).as_primitive::().values(), ); dists.push((dist, i as u32)); } dists.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap()); dists.truncate(k); dists.into_iter().map(|(_, i)| i).collect() } pub async fn create_test_vector_dataset(output: &str, num_rows: usize, dim: i32) { let schema = Arc::new(Schema::new(vec![Field::new( "vector", DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), dim), false, )])); let mut batches = Vec::new(); // Create a few batches for _ in 0..2 { let v_builder = Float32Builder::new(); let mut list_builder = FixedSizeListBuilder::new(v_builder, dim); for _ in 0..num_rows { for _ in 0..dim { list_builder.values().append_value(rand::random::()); } list_builder.append(true); } let array = Arc::new(list_builder.finish()); let batch = RecordBatch::try_new(schema.clone(), vec![array]).unwrap(); batches.push(batch); } let batch_reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema.clone()); println!("Writing dataset to {}", output); Dataset::write(batch_reader, output, None).await.unwrap(); } #[tokio::main] async fn main() { let uri: Option = None; // None means generate test data let column = "vector"; let ef = 100; let max_edges = 30; let max_level = 7; // 1. Generate a synthetic test data of specified dimensions let dataset = if uri.is_none() { println!("No uri is provided, generating test dataset..."); let output = "test_vectors.lance"; create_test_vector_dataset(output, 1000, 64).await; Dataset::open(output).await.expect("Failed to open dataset") } else { Dataset::open(uri.as_ref().unwrap()) .await .expect("Failed to open dataset") }; println!("Dataset schema: {:#?}", dataset.schema()); let batches = dataset .scan() .project(&[column]) .unwrap() .try_into_stream() .await .unwrap() .then(|batch| async move { batch.unwrap().column_by_name(column).unwrap().clone() }) .collect::>() .await; let arrs = batches.iter().map(|b| b.as_ref()).collect::>(); let fsl = concat(&arrs).unwrap().as_fixed_size_list().clone(); println!("Loaded {:?} batches", fsl.len()); let vector_store = Arc::new(FlatFloatStorage::new(fsl.clone(), DistanceType::L2)); let q = fsl.value(0); let k = 10; let gt = ground_truth(&fsl, q.as_primitive::().values(), k); for ef_construction in [15, 30, 50] { let now = std::time::Instant::now(); // 2. Build a hierarchical graph structure for efficient vector search using Lance API let hnsw = HNSW::index_vectors( vector_store.as_ref(), HnswBuildParams::default() .max_level(max_level) .num_edges(max_edges) .ef_construction(ef_construction), ) .unwrap(); let construct_time = now.elapsed().as_secs_f32(); let now = std::time::Instant::now(); // 3. Perform vector search with different parameters and compute the ground truth using L2 distance search let results: HashSet = hnsw .search_basic(q.clone(), k, ef, None, vector_store.as_ref()) .unwrap() .iter() .map(|node| node.id) .collect(); let search_time = now.elapsed().as_micros(); println!( "level={}, ef_construct={}, ef={} recall={}: construct={:.3}s search={:.3} us", max_level, ef_construction, ef, results.intersection(>).count() as f32 / k as f32, construct_time, search_time ); } } ``` ---------------------------------------- TITLE: Compare LanceDB benchmarks against previous version DESCRIPTION: Provides a sequence of commands to compare the performance of the current version against the `main` branch. This involves saving a baseline from `main` and then comparing the current branch's performance against it. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_15 LANGUAGE: shell CODE: ``` CURRENT_BRANCH=$(git branch --show-current) ``` LANGUAGE: shell CODE: ``` git checkout main ``` LANGUAGE: shell CODE: ``` maturin develop --profile release-with-debug --features datagen ``` LANGUAGE: shell CODE: ``` pytest --benchmark-save=baseline python/benchmarks -m "not slow" ``` LANGUAGE: shell CODE: ``` COMPARE_ID=$(ls .benchmarks/*/ | tail -1 | cut -c1-4) ``` LANGUAGE: shell CODE: ``` git checkout $CURRENT_BRANCH ``` LANGUAGE: shell CODE: ``` maturin develop --profile release-with-debug --features datagen ``` LANGUAGE: shell CODE: ``` pytest --benchmark-compare=$COMPARE_ID python/benchmarks -m "not slow" ``` ---------------------------------------- TITLE: Build Rust core format (debug) DESCRIPTION: This command compiles the Rust core format in debug mode. The debug build includes debugging information and is suitable for development and testing, though it is not optimized for performance. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_4 LANGUAGE: bash CODE: ``` cargo build ``` ---------------------------------------- TITLE: Format and lint Rust code DESCRIPTION: These commands are used to automatically format Rust code according to community standards (`cargo fmt`) and to perform static analysis for potential issues (`cargo clippy`). This ensures code quality and consistency. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_3 LANGUAGE: bash CODE: ``` cargo fmt --all cargo clippy --all-features --tests --benches ``` ---------------------------------------- TITLE: Run LanceDB code linters DESCRIPTION: Executes code linters to check for style violations and potential issues. Language-specific linting can be performed with `make lint-python` or `make lint-rust`. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_5 LANGUAGE: shell CODE: ``` make lint ``` ---------------------------------------- TITLE: Clean LanceDB build artifacts DESCRIPTION: Removes all generated build artifacts and temporary files from the project directory, useful for a clean rebuild. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_9 LANGUAGE: shell CODE: ``` make clean ``` ---------------------------------------- TITLE: Rust: Load Tokenizer from Hugging Face Hub DESCRIPTION: This function provides a utility to load a tokenizer from the Hugging Face Hub. It takes a model name, creates an API client, retrieves the tokenizer file from the specified repository, and constructs a `Tokenizer` object from it. This is a common pattern for integrating Hugging Face models. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_3 LANGUAGE: Rust CODE: ``` fn load_tokenizer(model_name: &str) -> Result> { let api = Api::new()?; let repo = api.repo(Repo::with_revision( model_name.into(), RepoType::Model, "main".into(), )); let tokenizer_path = repo.get("tokenizer.json")?; let tokenizer = Tokenizer::from_file(tokenizer_path)?; Ok(tokenizer) } ``` ---------------------------------------- TITLE: Build MacOS x86_64 Wheels DESCRIPTION: This command builds release-mode wheels specifically for x86_64 MacOS. It uses `maturin` to compile the project for the `x86_64-apple-darwin` target, storing the resulting wheels in the 'wheels' directory. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_26 LANGUAGE: Shell CODE: ``` maturin build --release \ --target x86_64-apple-darwin \ --out wheels ``` ---------------------------------------- TITLE: Build and Test Lance Rust Package DESCRIPTION: These commands clone the Lance repository, navigate to the Rust directory, and then build, test, and benchmark the core Rust components of Lance. SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_2 LANGUAGE: bash CODE: ``` git checkout https://github.com/lancedb/lance.git # Build rust package cd rust cargo build # Run test cargo test # Run benchmarks cargo bench ``` ---------------------------------------- TITLE: Build LanceDB in development mode DESCRIPTION: Builds the Rust native module in place using `maturin`. This command needs to be re-run whenever Rust code changes, but is not required for Python code modifications. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_0 LANGUAGE: shell CODE: ``` maturin develop ``` ---------------------------------------- TITLE: Download Lindera Language Model DESCRIPTION: Command-line instruction to download a specific Lindera language model (e.g., ipadic, ko-dic, unidic) for LanceDB. Note that `lindera-cli` must be installed beforehand as Lindera models require compilation. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_4 LANGUAGE: bash CODE: ``` python -m lance.download lindera -l [ipadic|ko-dic|unidic] ``` ---------------------------------------- TITLE: Decorate Rust Unit Test for Tracing DESCRIPTION: To enable tracing for a Rust unit test, decorate it with the `#[lance_test_macros::test]` attribute. This macro wraps any existing test attributes, allowing tracing information to be collected during test execution. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_16 LANGUAGE: Rust CODE: ``` #[lance_test_macros::test(tokio::test)] async fn test() { ... } ``` ---------------------------------------- TITLE: Add Rust Toolchain Targets for Cross-Compilation DESCRIPTION: To build manylinux wheels for different Linux architectures, you must first add the corresponding Rust toolchain targets. These commands add the x86_64 and aarch64 GNU targets, enabling cross-compilation. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_22 LANGUAGE: Shell CODE: ``` rustup target add x86_64-unknown-linux-gnu rustup target add aarch64-unknown-linux-gnu ``` ---------------------------------------- TITLE: Build MacOS ARM64 Wheels DESCRIPTION: This command builds release-mode wheels specifically for ARM64 (aarch64) MacOS. It uses `maturin` to compile the project for the `aarch64-apple-darwin` target, storing the resulting wheels in the 'wheels' directory. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_25 LANGUAGE: Shell CODE: ``` maturin build --release \ --target aarch64-apple-darwin \ --out wheels ``` ---------------------------------------- TITLE: Rust: WikiTextBatchReader Next Batch Logic DESCRIPTION: This snippet shows the core logic for the `next` method of the `WikiTextBatchReader`. It attempts to build and retrieve the next Parquet reader from a list of available readers. If a reader is successfully built, it's used; otherwise, it handles errors or indicates that no more readers are available. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_1 LANGUAGE: Rust CODE: ``` if let Some(builder) = self.parquet_readers[self.current_reader_idx].take() { match builder.build() { Ok(reader) => { self.current_reader = Some(Box::new(reader)); self.current_reader_idx += 1; continue; } Err(e) => { return Some(Err(arrow::error::ArrowError::ExternalError(Box::new(e)))) } } } } // No more readers available return None; } ``` ---------------------------------------- TITLE: Run Rust Unit Test with Tracing Verbosity DESCRIPTION: Execute a Rust unit test with tracing enabled by setting the `LANCE_TESTING` environment variable to a desired verbosity level (e.g., 'debug', 'info'). This command will generate a JSON trace file in your working directory, which can be viewed in Chrome or Perfetto. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_17 LANGUAGE: Bash CODE: ``` LANCE_TESTING=debug cargo test dataset::tests::test_create_dataset ``` ---------------------------------------- TITLE: Build Linux x86_64 Manylinux Wheels DESCRIPTION: This command builds release-mode manylinux wheels for x86_64 Linux. It utilizes `maturin` with `zig` for cross-compilation, targeting `manylinux2014` compatibility, and outputs the generated wheels to the 'wheels' directory. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_23 LANGUAGE: Shell CODE: ``` maturin build --release --zig \ --target x86_64-unknown-linux-gnu \ --compatibility manylinux2014 \ --out wheels ``` ---------------------------------------- TITLE: Build Linux ARM64 Manylinux Wheels DESCRIPTION: This command builds release-mode manylinux wheels for ARM64 (aarch64) Linux. It uses `maturin` with `zig` for cross-compilation, targeting `manylinux2014` compatibility, and places the output wheels in the 'wheels' directory. SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_24 LANGUAGE: Shell CODE: ``` maturin build --release --zig \ --target aarch_64-unknown-linux-gnu \ --compatibility manylinux2014 \ --out wheels ``` ---------------------------------------- TITLE: Join Multiple Lance Datasets with SQL in Rust DataFusion DESCRIPTION: This Rust example illustrates how to register multiple Lance datasets (e.g., 'orders' and 'customers') as separate tables in DataFusion. It then performs a SQL `JOIN` operation between these tables to combine data based on a common key, demonstrating more complex query capabilities. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_1 LANGUAGE: rust CODE: ``` use datafusion::prelude::SessionContext; use crate::datafusion::LanceTableProvider; let ctx = SessionContext::new(); ctx.register_table("orders", Arc::new(LanceTableProvider::new( Arc::new(orders_dataset.clone()), /* with_row_id */ false, /* with_row_addr */ false, )))?; ctx.register_table("customers", Arc::new(LanceTableProvider::new( Arc::new(customers_dataset.clone()), /* with_row_id */ false, /* with_row_addr */ false, )))?; let df = ctx.sql(" SELECT o.order_id, o.amount, c.customer_name FROM orders o JOIN customers c ON o.customer_id = c.customer_id LIMIT 10 ").await?; let result = df.collect().await?; ``` ---------------------------------------- TITLE: Generate Flame Graph from Process ID DESCRIPTION: Generates a flame graph for a running process using its Process ID (PID). This command is used to capture and visualize CPU profiles, helping to identify performance bottlenecks in an application. SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_5 LANGUAGE: sh CODE: ``` flamegraph -p ``` ---------------------------------------- TITLE: Clone LanceDB GitHub Repository DESCRIPTION: Instructions to clone the LanceDB project repository from GitHub to a local machine. This is the first step for setting up the development environment. SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_11 LANGUAGE: shell CODE: ``` git clone https://github.com/lancedb/lance.git ``` ---------------------------------------- TITLE: Rust Implementation of WikiTextBatchReader DESCRIPTION: This Rust code defines `WikiTextBatchReader`, a custom implementation of `arrow::record_batch::RecordBatchReader`. It's designed to read text data from Parquet files, tokenize it using a `Tokenizer` from the `tokenizers` crate, and transform it into Arrow `RecordBatch`es. The `process_batch` method handles tokenization, limits the number of samples, and shuffles the tokenized IDs before creating the final `RecordBatch`. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_0 LANGUAGE: rust CODE: ``` use arrow::array::{Array, Int64Builder, ListBuilder, UInt32Array}; use arrow::datatypes::{DataType, Field, Schema}; use arrow::record_batch::RecordBatch; use arrow::record_batch::RecordBatchReader; use futures::StreamExt; use hf_hub::{api::sync::Api, Repo, RepoType}; use lance::dataset::WriteParams; use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder; use rand::seq::SliceRandom; use rand::SeedableRng; use std::error::Error; use std::fs::File; use std::io::Write; use std::sync::Arc; use tempfile::NamedTempFile; use tokenizers::Tokenizer; // Implement a custom stream batch reader struct WikiTextBatchReader { schema: Arc, parquet_readers: Vec>>, current_reader_idx: usize, current_reader: Option>, tokenizer: Tokenizer, num_samples: u64, cur_samples_cnt: u64, } impl WikiTextBatchReader { fn new( parquet_readers: Vec>, tokenizer: Tokenizer, num_samples: Option, ) -> Result> { let schema = Arc::new(Schema::new(vec![Field::new( "input_ids", DataType::List(Arc::new(Field::new("item", DataType::Int64, true))), false, )])); Ok(Self { schema, parquet_readers: parquet_readers.into_iter().map(Some).collect(), current_reader_idx: 0, current_reader: None, tokenizer, num_samples: num_samples.unwrap_or(100_000), cur_samples_cnt: 0, }) } fn process_batch( &mut self, input_batch: &RecordBatch, ) -> Result { let num_rows = input_batch.num_rows(); let mut token_builder = ListBuilder::new(Int64Builder::with_capacity(num_rows * 1024)); // Pre-allocate space let mut should_break = false; let column = input_batch.column_by_name("text").unwrap(); let string_array = column .as_any() .downcast_ref::() .unwrap(); for i in 0..num_rows { if self.cur_samples_cnt >= self.num_samples { should_break = true; break; } if !Array::is_null(string_array, i) { let text = string_array.value(i); // Split paragraph into lines for line in text.split(' ') { if let Ok(encoding) = self.tokenizer.encode(line, true) { let tb_values = token_builder.values(); for &id in encoding.get_ids() { tb_values.append_value(id as i64); } token_builder.append(true); self.cur_samples_cnt += 1; if self.cur_samples_cnt % 5000 == 0 { println!("Processed {} rows", self.cur_samples_cnt); } if self.cur_samples_cnt >= self.num_samples { should_break = true; break; } } } } } // Create array and shuffle it let input_ids_array = token_builder.finish(); // Create shuffled array by randomly sampling indices let mut rng = rand::rngs::StdRng::seed_from_u64(1337); let len = input_ids_array.len(); let mut indices: Vec = (0..len as u32).collect(); indices.shuffle(&mut rng); // Take values in shuffled order let indices_array = UInt32Array::from(indices); let shuffled = arrow::compute::take(&input_ids_array, &indices_array, None)?; let batch = RecordBatch::try_new(self.schema.clone(), vec![Arc::new(shuffled)]); if should_break { println!("Stop at {} rows", self.cur_samples_cnt); self.parquet_readers.clear(); self.current_reader = None; } batch } } impl RecordBatchReader for WikiTextBatchReader { fn schema(&self) -> Arc { self.schema.clone() } } impl Iterator for WikiTextBatchReader { type Item = Result; fn next(&mut self) -> Option { loop { // If we have a current reader, try to get next batch if let Some(reader) = &mut self.current_reader { if let Some(batch_result) = reader.next() { return Some(batch_result.and_then(|batch| self.process_batch(&batch))); } } // If no current reader or current reader is exhausted, try to get next reader if self.current_reader_idx < self.parquet_readers.len() { ``` ---------------------------------------- TITLE: Set DYLD_LIBRARY_PATH for Lance Python Debugging in LLDB DESCRIPTION: Configures the `DYLD_LIBRARY_PATH` environment variable specifically for debugging Lance Python projects within LLDB. This ensures that the dynamic linker can find necessary shared libraries located in the third-party distribution directory. SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_1 LANGUAGE: lldb CODE: ``` # /path/to/lance/python/.lldbinit env DYLD_LIBRARY_PATH=/path/to/thirdparty/dist/lib:${DYLD_LIBRARY_PATH} ``` ---------------------------------------- TITLE: Download and extract MeCab Ipadic model DESCRIPTION: This snippet downloads the gzipped tarball of the MeCab Ipadic model from GitHub and then extracts its contents using tar. This is the first step in preparing the dictionary for building. SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_0 LANGUAGE: bash CODE: ``` curl -L -o mecab-ipadic-2.7.0-20070801.tar.gz "https://github.com/lindera-morphology/mecab-ipadic/archive/refs/tags/2.7.0-20070801.tar.gz" tar xvf mecab-ipadic-2.7.0-20070801.tar.gz ``` ---------------------------------------- TITLE: Build user dictionary with Lindera DESCRIPTION: This command demonstrates how to build a custom user dictionary using 'lindera build'. It takes a CSV file as input and creates a new user dictionary, which can be used to extend the base language model. SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_2 LANGUAGE: bash CODE: ``` lindera build --build-user-dictionary --dictionary-kind=ipadic user_dict/userdict.csv user_dict2 ``` ---------------------------------------- TITLE: Download Jieba Language Model DESCRIPTION: Command-line instruction to download the Jieba language model for use with LanceDB. The model will be automatically stored in the default Jieba model directory within the configured language model home. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_1 LANGUAGE: bash CODE: ``` python -m lance.download jieba ``` ---------------------------------------- TITLE: Read and Inspect Lance Dataset in Rust DESCRIPTION: This Rust function `read_dataset` shows how to open an existing Lance dataset from a given path. It uses a `scanner` to create a `batch_stream` and then iterates through each `RecordBatch`, printing its number of rows, columns, schema, and the entire batch content. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_1 LANGUAGE: Rust CODE: ``` // Reads dataset from the given path and prints batch size, schema for all record batches. Also extracts and prints a slice from the first batch async fn read_dataset(data_path: &str) { let dataset = Dataset::open(data_path).await.unwrap(); let scanner = dataset.scan(); let mut batch_stream = scanner.try_into_stream().await.unwrap().map(|b| b.unwrap()); while let Some(batch) = batch_stream.next().await { println!("Batch size: {}, {}", batch.num_rows(), batch.num_columns()); // print size of batch println!("Schema: {:?}", batch.schema()); // print schema of recordbatch println!("Batch: {:?}", batch); // print the entire recordbatch (schema and data) } } // End read dataset ``` ---------------------------------------- TITLE: Create a Lance Dataset from Arrow RecordBatches in Rust DESCRIPTION: Demonstrates how to write a collection of Arrow RecordBatches and an Arrow Schema into a new Lance Dataset. It uses default write parameters and an iterator for the batches. SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_1 LANGUAGE: rust CODE: ``` use lance::{dataset::WriteParams, Dataset}; let write_params = WriteParams::default(); let mut reader = RecordBatchIterator::new( batches.into_iter().map(Ok), schema ); Dataset::write(reader, &uri, Some(write_params)).await.unwrap(); ``` ---------------------------------------- TITLE: Build Ipadic language model with Lindera DESCRIPTION: This command uses the 'lindera build' tool to compile the Ipadic dictionary. It specifies the dictionary kind as 'ipadic' and points to the extracted model directory to create the main dictionary. SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_1 LANGUAGE: bash CODE: ``` lindera build --dictionary-kind=ipadic mecab-ipadic-2.7.0-20070801 main ``` ---------------------------------------- TITLE: Write Lance Dataset in Rust DESCRIPTION: This Rust function `write_dataset` demonstrates how to create and write a Lance dataset to a specified path. It defines a schema with `UInt32` fields, creates a `RecordBatch` with sample data, and uses `WriteParams` to set the write mode to `Overwrite` before writing the dataset to disk. SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_0 LANGUAGE: Rust CODE: ``` // Writes sample dataset to the given path async fn write_dataset(data_path: &str) { // Define new schema let schema = Arc::new(Schema::new(vec![ Field::new("key", DataType::UInt32, false), Field::new("value", DataType::UInt32, false), ])); // Create new record batches let batch = RecordBatch::try_new( schema.clone(), vec![ Arc::new(UInt32Array::from(vec![1, 2, 3, 4, 5, 6])), Arc::new(UInt32Array::from(vec![6, 7, 8, 9, 10, 11])), ], ) .unwrap(); let batches = RecordBatchIterator::new([Ok(batch)], schema.clone()); // Define write parameters (e.g. overwrite dataset) let write_params = WriteParams { mode: WriteMode::Overwrite, ..Default::default() }; Dataset::write(batches, data_path, Some(write_params)) .await .unwrap(); } // End write dataset ``` ---------------------------------------- TITLE: Build LanceDB Rust JNI Module DESCRIPTION: Specifies the command to build only the Rust-based JNI (Java Native Interface) module of LanceDB. This is useful for developers focusing on the native components without rebuilding the entire Java project. SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_10 LANGUAGE: shell CODE: ``` cargo build ``` ---------------------------------------- TITLE: Read a Lance Dataset and Collect RecordBatches in Rust DESCRIPTION: Opens an existing Lance Dataset from a specified path, scans its content, and collects all resulting RecordBatches into a vector. Error handling is included. SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_2 LANGUAGE: rust CODE: ``` let dataset = Dataset::open(path).await.unwrap(); let mut scanner = dataset.scan(); let batches: Vec = scanner .try_into_stream() .await .unwrap() .map(|b| b.unwrap()) .collect::>() .await; ``` ---------------------------------------- TITLE: Create a Vector Index on a Lance Dataset in Rust DESCRIPTION: Demonstrates how to create a vector index on a specified column (e.g., 'embeddings') within a Lance Dataset. It configures vector index parameters like the number of partitions and sub-vectors, noting potential alignment requirements. SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_4 LANGUAGE: rust CODE: ``` use ::lance::index::vector::VectorIndexParams; let params = VectorIndexParams::default(); params.num_partitions = 256; params.num_sub_vectors = 16; // this will Err if list_size(embeddings) / num_sub_vectors does not meet simd alignment dataset.create_index(&["embeddings"], IndexType::Vector, None, ¶ms, true).await; ``` ---------------------------------------- TITLE: Retrieve Specific Records from a Lance Dataset in Rust DESCRIPTION: Retrieves specific records from a Lance Dataset based on their indices and a projection. The result is a RecordBatch containing the requested data. SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_3 LANGUAGE: rust CODE: ``` let values: Result = dataset.take(&[200, 199, 39, 40, 100], &projection).await; ```