1251 lines
		
	
	
		
			43 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			1251 lines
		
	
	
		
			43 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| Based on your request, here is a copy of the provided code snippets filtered to include only those relevant to the Rust ecosystem. This includes snippets written in Rust, shell commands for building or testing Rust code (e.g., using `cargo` or `maturin`), and configurations for native development tools like `lldb`.
 | |
| 
 | |
| ========================
 | |
| CODE SNIPPETS
 | |
| ========================
 | |
| TITLE: Perform Python development installation
 | |
| DESCRIPTION: These commands navigate into the `python` directory and perform a development installation of the Lance Python bindings. This allows developers to import and test changes to the Python wrapper directly.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_1
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| cd python
 | |
| maturin develop
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Install Python bindings build tool
 | |
| DESCRIPTION: This command installs `maturin`, a tool essential for building Python packages that integrate with Rust code. It's a prerequisite for setting up the Python development environment for Lance.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_0
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| pip install maturin
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Install Linux Perf Tools and Configure Kernel Parameters
 | |
| DESCRIPTION: Installs necessary Linux performance tools (`perf`) on Ubuntu systems and configures the `perf_event_paranoid` kernel parameter. This setup is crucial for allowing non-root users to collect performance data using tools like `perf` and `flamegraph`.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_4
 | |
| 
 | |
| LANGUAGE: sh
 | |
| CODE:
 | |
| ```
 | |
| sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`
 | |
| sudo sh -c "echo -1 >  /proc/sys/kernel/perf_event_paranoid"
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Run Rust unit tests
 | |
| DESCRIPTION: This command executes the unit tests for the Rust core format. Running these tests verifies the correctness of the Rust implementation.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_6
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| cargo test
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Profile a LanceDB benchmark using flamegraph
 | |
| DESCRIPTION: Generates a flamegraph for a specific benchmark using `cargo-flamegraph`, aiding in performance analysis. It's recommended to run benchmarks once beforehand to avoid setup time being captured in the profile.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_14
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| flamegraph -F 100 --no-inline -- $(which python) \
 | |
|     -m pytest python/benchmarks \
 | |
|     --benchmark-min-time=2 \
 | |
|     -k test_ivf_pq_index_search
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Install Flamegraph Tool
 | |
| DESCRIPTION: Installs the `flamegraph` profiling tool using Cargo, Rust's package manager. This tool is essential for visualizing CPU usage and call stacks as flame graphs for performance analysis.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_3
 | |
| 
 | |
| LANGUAGE: sh
 | |
| CODE:
 | |
| ```
 | |
| cargo install flamegraph
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Install Lance Build Dependencies on Ubuntu
 | |
| DESCRIPTION: This command installs necessary system-level dependencies for building Lance on Ubuntu 22.04, including protobuf, SSL development libraries, and general build tools.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_0
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| sudo apt install protobuf-compiler libssl-dev build-essential pkg-config gfortran
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build Rust core format (release)
 | |
| DESCRIPTION: This command compiles the Rust core format in release mode. The release build is optimized for performance and is suitable for production deployments or benchmarking.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_5
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| cargo build -r
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Debug Python Script with LLDB
 | |
| DESCRIPTION: Demonstrates how to start an LLDB debugging session for a Python script. It involves launching LLDB with the Python interpreter from a virtual environment and then running the target script within the LLDB prompt.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_2
 | |
| 
 | |
| LANGUAGE: sh
 | |
| CODE:
 | |
| ```
 | |
| $ lldb ./venv/bin/python
 | |
| (lldb) r script.py
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Install Lance Build Dependencies on Mac
 | |
| DESCRIPTION: This command installs the protobuf compiler using Homebrew, a required dependency for building Lance on macOS.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_1
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| brew install protobuf
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Configure LLDB Initialization Settings
 | |
| DESCRIPTION: Sets up basic LLDB initialization settings in the `~/.lldbinit` file. This includes configuring the number of source code lines to display before and after a stop, and enabling the loading of `.lldbinit` files from the current working directory.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_0
 | |
| 
 | |
| LANGUAGE: lldb
 | |
| CODE:
 | |
| ```
 | |
| # ~/.lldbinit
 | |
| settings set stop-line-count-before 15
 | |
| settings set stop-line-count-after 15
 | |
| settings set target.load-cwd-lldbinit true
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Complete Lance Dataset Write and Read Example in Rust
 | |
| DESCRIPTION: This Rust `main` function provides a complete example demonstrating the usage of `write_dataset` and `read_dataset` functions. It sets up the necessary `arrow` and `lance` imports, defines a temporary data path, and orchestrates the writing and subsequent reading of a Lance dataset.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_2
 | |
| 
 | |
| LANGUAGE: Rust
 | |
| CODE:
 | |
| ```
 | |
| use arrow::array::UInt32Array;
 | |
| use arrow::datatypes::{DataType, Field, Schema};
 | |
| use arrow::record_batch::{RecordBatch, RecordBatchIterator};
 | |
| use futures::StreamExt;
 | |
| use lance::dataset::{WriteMode, WriteParams};
 | |
| use lance::Dataset;
 | |
| use std::sync::Arc;
 | |
| 
 | |
| #[tokio::main]
 | |
| async fn main() {
 | |
|     let data_path: &str = "./temp_data.lance";
 | |
| 
 | |
|     write_dataset(data_path).await;
 | |
|     read_dataset(data_path).await;
 | |
| }
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Rust: Main Workflow for WikiText to LanceDB Ingestion
 | |
| DESCRIPTION: This comprehensive example demonstrates the full data ingestion pipeline in Rust. It initializes a Tokio runtime, loads a tokenizer, sets up the Hugging Face API to download WikiText Parquet files, processes them into a `WikiTextBatchReader`, and finally writes the data to a Lance dataset. It also includes verification of the created dataset.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_2
 | |
| 
 | |
| LANGUAGE: Rust
 | |
| CODE:
 | |
| ```
 | |
| fn main() -> Result<(), Box<dyn Error + Send + Sync>> {
 | |
|     let rt = tokio::runtime::Runtime::new()?;
 | |
|     rt.block_on(async {
 | |
|         // Load tokenizer
 | |
|         let tokenizer = load_tokenizer("gpt2")?;
 | |
| 
 | |
|         // Set up Hugging Face API
 | |
|         // Download from https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-raw-v1
 | |
|         let api = Api::new()?;
 | |
|         let repo = api.repo(Repo::with_revision(
 | |
|             "Salesforce/wikitext".into(),
 | |
|             RepoType::Dataset,
 | |
|             "main".into(),
 | |
|         ));
 | |
| 
 | |
|         // Define the parquet files we want to download
 | |
|         let train_files = vec![
 | |
|             "wikitext-103-raw-v1/train-00000-of-00002.parquet",
 | |
|             "wikitext-103-raw-v1/train-00001-of-00002.parquet",
 | |
|         ];
 | |
| 
 | |
|         let mut parquet_readers = Vec::new();
 | |
|         for file in &train_files {
 | |
|             println!("Downloading file: {}", file);
 | |
|             let file_path = repo.get(file)?;
 | |
|             let data = std::fs::read(file_path)?;
 | |
| 
 | |
|             // Create a temporary file in the system temp directory and write the downloaded data to it
 | |
|             let mut temp_file = NamedTempFile::new()?;
 | |
|             temp_file.write_all(&data)?;
 | |
| 
 | |
|             // Create the parquet reader builder with a larger batch size
 | |
|             let builder = ParquetRecordBatchReaderBuilder::try_new(temp_file.into_file())?
 | |
|                 .with_batch_size(8192); // Increase batch size for better performance
 | |
|             parquet_readers.push(builder);
 | |
|         }
 | |
| 
 | |
|         if parquet_readers.is_empty() {
 | |
|             println!("No parquet files found to process.");
 | |
|             return Ok(());
 | |
|         }
 | |
| 
 | |
|         // Create batch reader
 | |
|         let num_samples: u64 = 500_000;
 | |
|         let batch_reader = WikiTextBatchReader::new(parquet_readers, tokenizer, Some(num_samples))?;
 | |
| 
 | |
|         // Save as Lance dataset
 | |
|         println!("Writing to Lance dataset...");
 | |
|         let lance_dataset_path = "rust_wikitext_lance_dataset.lance";
 | |
| 
 | |
|         let write_params = WriteParams::default();
 | |
|         lance::Dataset::write(batch_reader, lance_dataset_path, Some(write_params)).await?;
 | |
| 
 | |
|         // Verify the dataset
 | |
|         let ds = lance::Dataset::open(lance_dataset_path).await?;
 | |
|         let scanner = ds.scan();
 | |
|         let mut stream = scanner.try_into_stream().await?;
 | |
| 
 | |
|         let mut total_rows = 0;
 | |
|         while let Some(batch_result) = stream.next().await {
 | |
|             let batch = batch_result?;
 | |
|             total_rows += batch.num_rows();
 | |
|         }
 | |
| 
 | |
|         println!(
 | |
|             "Lance dataset created successfully with {} rows",
 | |
|             total_rows
 | |
|         );
 | |
|         println!("Dataset location: {}", lance_dataset_path);
 | |
| 
 | |
|         Ok(())
 | |
|     })
 | |
| }
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build and Test Pylance Python Package
 | |
| DESCRIPTION: These commands set up a Python virtual environment, install maturin for Rust-Python binding, build the Pylance package in debug mode, and then run its associated tests.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_3
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| cd python
 | |
| python3 -m venv venv
 | |
| source venv/bin/activate
 | |
| 
 | |
| pip install maturin
 | |
| 
 | |
| # Build debug build
 | |
| maturin develop --extras tests
 | |
| 
 | |
| # Run pytest
 | |
| pytest python/tests/
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Install Lance using Cargo
 | |
| DESCRIPTION: Installs the Lance Rust library as a command-line tool using the Cargo package manager.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_0
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| cargo install lance
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build pylance in release mode for benchmarks
 | |
| DESCRIPTION: Builds the `pylance` module in release mode with debug symbols, enabling benchmark execution and profiling. It includes benchmark-specific extras and features for data generation.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_10
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| maturin develop --profile release-with-debug --extras benchmarks --features datagen
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Query Lance Dataset with Simple SQL in Rust DataFusion
 | |
| DESCRIPTION: This Rust example demonstrates how to register a Lance dataset as a table in DataFusion using `LanceTableProvider` and execute a simple SQL `SELECT` query to retrieve the first 10 rows. It shows the basic setup for integrating Lance with DataFusion.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_0
 | |
| 
 | |
| LANGUAGE: rust
 | |
| CODE:
 | |
| ```
 | |
| use datafusion::prelude::SessionContext;
 | |
| use crate::datafusion::LanceTableProvider;
 | |
| 
 | |
| let ctx = SessionContext::new();
 | |
| 
 | |
| ctx.register_table("dataset",
 | |
|     Arc::new(LanceTableProvider::new(
 | |
|     Arc::new(dataset.clone()),
 | |
|     /* with_row_id */ false,
 | |
|     /* with_row_addr */ false,
 | |
|     )))?;
 | |
| 
 | |
| let df = ctx.sql("SELECT * FROM dataset LIMIT 10").await?;
 | |
| let result = df.collect().await?;
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Run LanceDB code formatters
 | |
| DESCRIPTION: Applies code formatting rules to the entire project. Specific commands like `make format-python` or `cargo fmt` can be used for language-specific formatting.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_4
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| make format
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build and Search HNSW Index for Vector Similarity in Rust
 | |
| DESCRIPTION: This Rust code provides a complete example for vector similarity search. It defines a `ground_truth` function for L2 distance calculation, `create_test_vector_dataset` to generate synthetic fixed-size list vectors, and a `main` function that orchestrates the process. The `main` function generates or loads a dataset, builds an HNSW index using `lance_index::vector::hnsw`, and then performs vector searches, measuring construction and search times, and calculating recall against ground truth.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/hnsw.md#_snippet_0
 | |
| 
 | |
| LANGUAGE: Rust
 | |
| CODE:
 | |
| ```
 | |
| use std::collections::HashSet;
 | |
| use std::sync::Arc;
 | |
| 
 | |
| use arrow::array::{types::Float32Type, Array, FixedSizeListArray};
 | |
| use arrow::array::{AsArray, FixedSizeListBuilder, Float32Builder};
 | |
| use arrow::datatypes::{DataType, Field, Schema};
 | |
| use arrow::record_batch::RecordBatch;
 | |
| use arrow::record_batch::RecordBatchIterator;
 | |
| use arrow_select::concat::concat;
 | |
| use futures::stream::StreamExt;
 | |
| use lance::Dataset;
 | |
| use lance_index::vector::v3::subindex::IvfSubIndex;
 | |
| use lance_index::vector::{
 | |
|     flat::storage::FlatFloatStorage,
 | |
|     hnsw::{builder::HnswBuildParams, HNSW},
 | |
| };
 | |
| use lance_linalg::distance::DistanceType;
 | |
| 
 | |
| fn ground_truth(fsl: &FixedSizeListArray, query: &[f32], k: usize) -> HashSet<u32> {
 | |
|     let mut dists = vec![];
 | |
|     for i in 0..fsl.len() {
 | |
|         let dist = lance_linalg::distance::l2_distance(
 | |
|             query,
 | |
|             fsl.value(i).as_primitive::<Float32Type>().values(),
 | |
|         );
 | |
|         dists.push((dist, i as u32));
 | |
|     }
 | |
|     dists.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap());
 | |
|     dists.truncate(k);
 | |
|     dists.into_iter().map(|(_, i)| i).collect()
 | |
| }
 | |
| 
 | |
| pub async fn create_test_vector_dataset(output: &str, num_rows: usize, dim: i32) {
 | |
|     let schema = Arc::new(Schema::new(vec![Field::new(
 | |
|         "vector",
 | |
|         DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), dim),
 | |
|         false,
 | |
|     )]));
 | |
| 
 | |
|     let mut batches = Vec::new();
 | |
| 
 | |
|     // Create a few batches
 | |
|     for _ in 0..2 {
 | |
|         let v_builder = Float32Builder::new();
 | |
|         let mut list_builder = FixedSizeListBuilder::new(v_builder, dim);
 | |
| 
 | |
|         for _ in 0..num_rows {
 | |
|             for _ in 0..dim {
 | |
|                 list_builder.values().append_value(rand::random::<f32>());
 | |
|             }
 | |
|             list_builder.append(true);
 | |
|         }
 | |
|         let array = Arc::new(list_builder.finish());
 | |
|         let batch = RecordBatch::try_new(schema.clone(), vec![array]).unwrap();
 | |
|         batches.push(batch);
 | |
|     }
 | |
|     let batch_reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema.clone());
 | |
|     println!("Writing dataset to {}", output);
 | |
|     Dataset::write(batch_reader, output, None).await.unwrap();
 | |
| }
 | |
| 
 | |
| #[tokio::main]
 | |
| async fn main() {
 | |
|     let uri: Option<String> = None; // None means generate test data
 | |
|     let column = "vector";
 | |
|     let ef = 100;
 | |
|     let max_edges = 30;
 | |
|     let max_level = 7;
 | |
| 
 | |
|     // 1. Generate a synthetic test data of specified dimensions
 | |
|     let dataset = if uri.is_none() {
 | |
|         println!("No uri is provided, generating test dataset...");
 | |
|         let output = "test_vectors.lance";
 | |
|         create_test_vector_dataset(output, 1000, 64).await;
 | |
|         Dataset::open(output).await.expect("Failed to open dataset")
 | |
|     } else {
 | |
|         Dataset::open(uri.as_ref().unwrap())
 | |
|             .await
 | |
|             .expect("Failed to open dataset")
 | |
|     };
 | |
| 
 | |
|     println!("Dataset schema: {:#?}", dataset.schema());
 | |
|     let batches = dataset
 | |
|         .scan()
 | |
|         .project(&[column])
 | |
|         .unwrap()
 | |
|         .try_into_stream()
 | |
|         .await
 | |
|         .unwrap()
 | |
|         .then(|batch| async move { batch.unwrap().column_by_name(column).unwrap().clone() })
 | |
|         .collect::<Vec<_>>()
 | |
|         .await;
 | |
|     let arrs = batches.iter().map(|b| b.as_ref()).collect::<Vec<_>>();
 | |
|     let fsl = concat(&arrs).unwrap().as_fixed_size_list().clone();
 | |
|     println!("Loaded {:?} batches", fsl.len());
 | |
| 
 | |
|     let vector_store = Arc::new(FlatFloatStorage::new(fsl.clone(), DistanceType::L2));
 | |
| 
 | |
|     let q = fsl.value(0);
 | |
|     let k = 10;
 | |
|     let gt = ground_truth(&fsl, q.as_primitive::<Float32Type>().values(), k);
 | |
| 
 | |
|     for ef_construction in [15, 30, 50] {
 | |
|         let now = std::time::Instant::now();
 | |
|         // 2. Build a hierarchical graph structure for efficient vector search using Lance API
 | |
|         let hnsw = HNSW::index_vectors(
 | |
|             vector_store.as_ref(),
 | |
|             HnswBuildParams::default()
 | |
|                 .max_level(max_level)
 | |
|                 .num_edges(max_edges)
 | |
|                 .ef_construction(ef_construction),
 | |
|         )
 | |
|         .unwrap();
 | |
|         let construct_time = now.elapsed().as_secs_f32();
 | |
|         let now = std::time::Instant::now();
 | |
|         // 3. Perform vector search with different parameters and compute the ground truth using L2 distance search
 | |
|         let results: HashSet<u32> = hnsw
 | |
|             .search_basic(q.clone(), k, ef, None, vector_store.as_ref())
 | |
|             .unwrap()
 | |
|             .iter()
 | |
|             .map(|node| node.id)
 | |
|             .collect();
 | |
|         let search_time = now.elapsed().as_micros();
 | |
|         println!(
 | |
|             "level={}, ef_construct={}, ef={} recall={}: construct={:.3}s search={:.3} us",
 | |
|             max_level,
 | |
|             ef_construction,
 | |
|             ef,
 | |
|             results.intersection(>).count() as f32 / k as f32,
 | |
|             construct_time,
 | |
|             search_time
 | |
|         );
 | |
|     }
 | |
| }
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Compare LanceDB benchmarks against previous version
 | |
| DESCRIPTION: Provides a sequence of commands to compare the performance of the current version against the `main` branch. This involves saving a baseline from `main` and then comparing the current branch's performance against it.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_15
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| CURRENT_BRANCH=$(git branch --show-current)
 | |
| ```
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| git checkout main
 | |
| ```
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| maturin develop --profile release-with-debug  --features datagen
 | |
| ```
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| pytest --benchmark-save=baseline python/benchmarks -m "not slow"
 | |
| ```
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| COMPARE_ID=$(ls .benchmarks/*/ | tail -1 | cut -c1-4)
 | |
| ```
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| git checkout $CURRENT_BRANCH
 | |
| ```
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| maturin develop --profile release-with-debug  --features datagen
 | |
| ```
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| pytest --benchmark-compare=$COMPARE_ID python/benchmarks -m "not slow"
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build Rust core format (debug)
 | |
| DESCRIPTION: This command compiles the Rust core format in debug mode. The debug build includes debugging information and is suitable for development and testing, though it is not optimized for performance.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_4
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| cargo build
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Format and lint Rust code
 | |
| DESCRIPTION: These commands are used to automatically format Rust code according to community standards (`cargo fmt`) and to perform static analysis for potential issues (`cargo clippy`). This ensures code quality and consistency.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/community/contributing.md#_snippet_3
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| cargo fmt --all
 | |
| cargo clippy --all-features --tests --benches
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Run LanceDB code linters
 | |
| DESCRIPTION: Executes code linters to check for style violations and potential issues. Language-specific linting can be performed with `make lint-python` or `make lint-rust`.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_5
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| make lint
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Clean LanceDB build artifacts
 | |
| DESCRIPTION: Removes all generated build artifacts and temporary files from the project directory, useful for a clean rebuild.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_9
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| make clean
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Rust: Load Tokenizer from Hugging Face Hub
 | |
| DESCRIPTION: This function provides a utility to load a tokenizer from the Hugging Face Hub. It takes a model name, creates an API client, retrieves the tokenizer file from the specified repository, and constructs a `Tokenizer` object from it. This is a common pattern for integrating Hugging Face models.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_3
 | |
| 
 | |
| LANGUAGE: Rust
 | |
| CODE:
 | |
| ```
 | |
| fn load_tokenizer(model_name: &str) -> Result<Tokenizer, Box<dyn Error + Send + Sync>> {
 | |
|     let api = Api::new()?;
 | |
|     let repo = api.repo(Repo::with_revision(
 | |
|         model_name.into(),
 | |
|         RepoType::Model,
 | |
|         "main".into(),
 | |
|     ));
 | |
| 
 | |
|     let tokenizer_path = repo.get("tokenizer.json")?;
 | |
|     let tokenizer = Tokenizer::from_file(tokenizer_path)?;
 | |
| 
 | |
|     Ok(tokenizer)
 | |
| }
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build MacOS x86_64 Wheels
 | |
| DESCRIPTION: This command builds release-mode wheels specifically for x86_64 MacOS. It uses `maturin` to compile the project for the `x86_64-apple-darwin` target, storing the resulting wheels in the 'wheels' directory.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_26
 | |
| 
 | |
| LANGUAGE: Shell
 | |
| CODE:
 | |
| ```
 | |
| maturin build --release \
 | |
|     --target x86_64-apple-darwin \
 | |
|     --out wheels
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build and Test Lance Rust Package
 | |
| DESCRIPTION: These commands clone the Lance repository, navigate to the Rust directory, and then build, test, and benchmark the core Rust components of Lance.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/__wiki__/How-to-Build.md#_snippet_2
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| git checkout https://github.com/lancedb/lance.git
 | |
| 
 | |
| # Build rust package
 | |
| cd rust
 | |
| cargo build
 | |
| 
 | |
| # Run test
 | |
| cargo test
 | |
| 
 | |
| # Run benchmarks
 | |
| cargo bench
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build LanceDB in development mode
 | |
| DESCRIPTION: Builds the Rust native module in place using `maturin`. This command needs to be re-run whenever Rust code changes, but is not required for Python code modifications.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_0
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| maturin develop
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Download Lindera Language Model
 | |
| DESCRIPTION: Command-line instruction to download a specific Lindera language model (e.g., ipadic, ko-dic, unidic) for LanceDB. Note that `lindera-cli` must be installed beforehand as Lindera models require compilation.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_4
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| python -m lance.download lindera -l [ipadic|ko-dic|unidic]
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Decorate Rust Unit Test for Tracing
 | |
| DESCRIPTION: To enable tracing for a Rust unit test, decorate it with the `#[lance_test_macros::test]` attribute. This macro wraps any existing test attributes, allowing tracing information to be collected during test execution.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_16
 | |
| 
 | |
| LANGUAGE: Rust
 | |
| CODE:
 | |
| ```
 | |
| #[lance_test_macros::test(tokio::test)]
 | |
| async fn test() {
 | |
|     ...
 | |
| }
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Add Rust Toolchain Targets for Cross-Compilation
 | |
| DESCRIPTION: To build manylinux wheels for different Linux architectures, you must first add the corresponding Rust toolchain targets. These commands add the x86_64 and aarch64 GNU targets, enabling cross-compilation.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_22
 | |
| 
 | |
| LANGUAGE: Shell
 | |
| CODE:
 | |
| ```
 | |
| rustup target add x86_64-unknown-linux-gnu
 | |
| rustup target add aarch64-unknown-linux-gnu
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build MacOS ARM64 Wheels
 | |
| DESCRIPTION: This command builds release-mode wheels specifically for ARM64 (aarch64) MacOS. It uses `maturin` to compile the project for the `aarch64-apple-darwin` target, storing the resulting wheels in the 'wheels' directory.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_25
 | |
| 
 | |
| LANGUAGE: Shell
 | |
| CODE:
 | |
| ```
 | |
| maturin build --release \
 | |
|     --target aarch64-apple-darwin \
 | |
|     --out wheels
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Rust: WikiTextBatchReader Next Batch Logic
 | |
| DESCRIPTION: This snippet shows the core logic for the `next` method of the `WikiTextBatchReader`. It attempts to build and retrieve the next Parquet reader from a list of available readers. If a reader is successfully built, it's used; otherwise, it handles errors or indicates that no more readers are available.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_1
 | |
| 
 | |
| LANGUAGE: Rust
 | |
| CODE:
 | |
| ```
 | |
|                 if let Some(builder) = self.parquet_readers[self.current_reader_idx].take() {
 | |
|                     match builder.build() {
 | |
|                         Ok(reader) => {
 | |
|                             self.current_reader = Some(Box::new(reader));
 | |
|                             self.current_reader_idx += 1;
 | |
|                             continue;
 | |
|                         }
 | |
|                         Err(e) => {
 | |
|                             return Some(Err(arrow::error::ArrowError::ExternalError(Box::new(e))))
 | |
|                         }
 | |
|                     }
 | |
|                 }
 | |
|             }
 | |
| 
 | |
|             // No more readers available
 | |
|             return None;
 | |
|         }
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Run Rust Unit Test with Tracing Verbosity
 | |
| DESCRIPTION: Execute a Rust unit test with tracing enabled by setting the `LANCE_TESTING` environment variable to a desired verbosity level (e.g., 'debug', 'info'). This command will generate a JSON trace file in your working directory, which can be viewed in Chrome or Perfetto.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_17
 | |
| 
 | |
| LANGUAGE: Bash
 | |
| CODE:
 | |
| ```
 | |
| LANCE_TESTING=debug cargo test dataset::tests::test_create_dataset
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build Linux x86_64 Manylinux Wheels
 | |
| DESCRIPTION: This command builds release-mode manylinux wheels for x86_64 Linux. It utilizes `maturin` with `zig` for cross-compilation, targeting `manylinux2014` compatibility, and outputs the generated wheels to the 'wheels' directory.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_23
 | |
| 
 | |
| LANGUAGE: Shell
 | |
| CODE:
 | |
| ```
 | |
| maturin build --release --zig \
 | |
|     --target x86_64-unknown-linux-gnu \
 | |
|     --compatibility manylinux2014 \
 | |
|     --out wheels
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build Linux ARM64 Manylinux Wheels
 | |
| DESCRIPTION: This command builds release-mode manylinux wheels for ARM64 (aarch64) Linux. It uses `maturin` with `zig` for cross-compilation, targeting `manylinux2014` compatibility, and places the output wheels in the 'wheels' directory.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/DEVELOPMENT.md#_snippet_24
 | |
| 
 | |
| LANGUAGE: Shell
 | |
| CODE:
 | |
| ```
 | |
| maturin build --release --zig \
 | |
|     --target aarch_64-unknown-linux-gnu \
 | |
|     --compatibility manylinux2014 \
 | |
|     --out wheels
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Join Multiple Lance Datasets with SQL in Rust DataFusion
 | |
| DESCRIPTION: This Rust example illustrates how to register multiple Lance datasets (e.g., 'orders' and 'customers') as separate tables in DataFusion. It then performs a SQL `JOIN` operation between these tables to combine data based on a common key, demonstrating more complex query capabilities.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/integrations/datafusion.md#_snippet_1
 | |
| 
 | |
| LANGUAGE: rust
 | |
| CODE:
 | |
| ```
 | |
| use datafusion::prelude::SessionContext;
 | |
| use crate::datafusion::LanceTableProvider;
 | |
| 
 | |
| let ctx = SessionContext::new();
 | |
| 
 | |
| ctx.register_table("orders",
 | |
|     Arc::new(LanceTableProvider::new(
 | |
|     Arc::new(orders_dataset.clone()),
 | |
|     /* with_row_id */ false,
 | |
|     /* with_row_addr */ false,
 | |
|     )))?;
 | |
| 
 | |
| ctx.register_table("customers",
 | |
|     Arc::new(LanceTableProvider::new(
 | |
|     Arc::new(customers_dataset.clone()),
 | |
|     /* with_row_id */ false,
 | |
|     /* with_row_addr */ false,
 | |
|     )))?;
 | |
| 
 | |
| let df = ctx.sql("
 | |
|     SELECT o.order_id, o.amount, c.customer_name 
 | |
|     FROM orders o 
 | |
|     JOIN customers c ON o.customer_id = c.customer_id
 | |
|     LIMIT 10
 | |
| ").await?;
 | |
| 
 | |
| let result = df.collect().await?;
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Generate Flame Graph from Process ID
 | |
| DESCRIPTION: Generates a flame graph for a running process using its Process ID (PID). This command is used to capture and visualize CPU profiles, helping to identify performance bottlenecks in an application.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_5
 | |
| 
 | |
| LANGUAGE: sh
 | |
| CODE:
 | |
| ```
 | |
| flamegraph -p <PID>
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Clone LanceDB GitHub Repository
 | |
| DESCRIPTION: Instructions to clone the LanceDB project repository from GitHub to a local machine. This is the first step for setting up the development environment.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_11
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| git clone https://github.com/lancedb/lance.git
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Rust Implementation of WikiTextBatchReader
 | |
| DESCRIPTION: This Rust code defines `WikiTextBatchReader`, a custom implementation of `arrow::record_batch::RecordBatchReader`. It's designed to read text data from Parquet files, tokenize it using a `Tokenizer` from the `tokenizers` crate, and transform it into Arrow `RecordBatch`es. The `process_batch` method handles tokenization, limits the number of samples, and shuffles the tokenized IDs before creating the final `RecordBatch`.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/llm_dataset_creation.md#_snippet_0
 | |
| 
 | |
| LANGUAGE: rust
 | |
| CODE:
 | |
| ```
 | |
| use arrow::array::{Array, Int64Builder, ListBuilder, UInt32Array};
 | |
| use arrow::datatypes::{DataType, Field, Schema};
 | |
| use arrow::record_batch::RecordBatch;
 | |
| use arrow::record_batch::RecordBatchReader;
 | |
| use futures::StreamExt;
 | |
| use hf_hub::{api::sync::Api, Repo, RepoType};
 | |
| use lance::dataset::WriteParams;
 | |
| use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
 | |
| use rand::seq::SliceRandom;
 | |
| use rand::SeedableRng;
 | |
| use std::error::Error;
 | |
| use std::fs::File;
 | |
| use std::io::Write;
 | |
| use std::sync::Arc;
 | |
| use tempfile::NamedTempFile;
 | |
| use tokenizers::Tokenizer;
 | |
| 
 | |
| // Implement a custom stream batch reader
 | |
| struct WikiTextBatchReader {
 | |
|     schema: Arc<Schema>,
 | |
|     parquet_readers: Vec<Option<ParquetRecordBatchReaderBuilder<File>>>,
 | |
|     current_reader_idx: usize,
 | |
|     current_reader: Option<Box<dyn RecordBatchReader + Send>>,
 | |
|     tokenizer: Tokenizer,
 | |
|     num_samples: u64,
 | |
|     cur_samples_cnt: u64,
 | |
| }
 | |
| 
 | |
| impl WikiTextBatchReader {
 | |
|     fn new(
 | |
|         parquet_readers: Vec<ParquetRecordBatchReaderBuilder<File>>,
 | |
|         tokenizer: Tokenizer,
 | |
|         num_samples: Option<u64>,
 | |
|     ) -> Result<Self, Box<dyn Error + Send + Sync>> {
 | |
|         let schema = Arc::new(Schema::new(vec![Field::new(
 | |
|             "input_ids",
 | |
|             DataType::List(Arc::new(Field::new("item", DataType::Int64, true))),
 | |
|             false,
 | |
|         )]));
 | |
| 
 | |
|         Ok(Self {
 | |
|             schema,
 | |
|             parquet_readers: parquet_readers.into_iter().map(Some).collect(),
 | |
|             current_reader_idx: 0,
 | |
|             current_reader: None,
 | |
|             tokenizer,
 | |
|             num_samples: num_samples.unwrap_or(100_000),
 | |
|             cur_samples_cnt: 0,
 | |
|         })
 | |
|     }
 | |
| 
 | |
|     fn process_batch(
 | |
|         &mut self,
 | |
|         input_batch: &RecordBatch,
 | |
|     ) -> Result<RecordBatch, arrow::error::ArrowError> {
 | |
|         let num_rows = input_batch.num_rows();
 | |
|         let mut token_builder = ListBuilder::new(Int64Builder::with_capacity(num_rows * 1024)); // Pre-allocate space
 | |
|         let mut should_break = false;
 | |
| 
 | |
|         let column = input_batch.column_by_name("text").unwrap();
 | |
|         let string_array = column
 | |
|             .as_any()
 | |
|             .downcast_ref::<arrow::array::StringArray>()
 | |
|             .unwrap();
 | |
|         for i in 0..num_rows {
 | |
|             if self.cur_samples_cnt >= self.num_samples {
 | |
|                 should_break = true;
 | |
|                 break;
 | |
|             }
 | |
|             if !Array::is_null(string_array, i) {
 | |
|                 let text = string_array.value(i);
 | |
|                 // Split paragraph into lines
 | |
|                 for line in text.split('
 | |
| ') {
 | |
|                     if let Ok(encoding) = self.tokenizer.encode(line, true) {
 | |
|                         let tb_values = token_builder.values();
 | |
|                         for &id in encoding.get_ids() {
 | |
|                             tb_values.append_value(id as i64);
 | |
|                         }
 | |
|                         token_builder.append(true);
 | |
|                         self.cur_samples_cnt += 1;
 | |
|                         if self.cur_samples_cnt % 5000 == 0 {
 | |
|                             println!("Processed {} rows", self.cur_samples_cnt);
 | |
|                         }
 | |
|                         if self.cur_samples_cnt >= self.num_samples {
 | |
|                             should_break = true;
 | |
|                             break;
 | |
|                         }
 | |
|                     }
 | |
|                 }
 | |
|             }
 | |
|         }
 | |
| 
 | |
|         // Create array and shuffle it
 | |
|         let input_ids_array = token_builder.finish();
 | |
| 
 | |
|         // Create shuffled array by randomly sampling indices
 | |
|         let mut rng = rand::rngs::StdRng::seed_from_u64(1337);
 | |
|         let len = input_ids_array.len();
 | |
|         let mut indices: Vec<u32> = (0..len as u32).collect();
 | |
|         indices.shuffle(&mut rng);
 | |
| 
 | |
|         // Take values in shuffled order
 | |
|         let indices_array = UInt32Array::from(indices);
 | |
|         let shuffled = arrow::compute::take(&input_ids_array, &indices_array, None)?;
 | |
| 
 | |
|         let batch = RecordBatch::try_new(self.schema.clone(), vec![Arc::new(shuffled)]);
 | |
|         if should_break {
 | |
|             println!("Stop at {} rows", self.cur_samples_cnt);
 | |
|             self.parquet_readers.clear();
 | |
|             self.current_reader = None;
 | |
|         }
 | |
| 
 | |
|         batch
 | |
|     }
 | |
| }
 | |
| 
 | |
| impl RecordBatchReader for WikiTextBatchReader {
 | |
|     fn schema(&self) -> Arc<Schema> {
 | |
|         self.schema.clone()
 | |
|     }
 | |
| }
 | |
| 
 | |
| impl Iterator for WikiTextBatchReader {
 | |
|     type Item = Result<RecordBatch, arrow::error::ArrowError>;
 | |
|     fn next(&mut self) -> Option<Self::Item> {
 | |
|         loop {
 | |
|             // If we have a current reader, try to get next batch
 | |
|             if let Some(reader) = &mut self.current_reader {
 | |
|                 if let Some(batch_result) = reader.next() {
 | |
|                     return Some(batch_result.and_then(|batch| self.process_batch(&batch)));
 | |
|                 }
 | |
|             }
 | |
| 
 | |
|             // If no current reader or current reader is exhausted, try to get next reader
 | |
|             if self.current_reader_idx < self.parquet_readers.len() {
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Set DYLD_LIBRARY_PATH for Lance Python Debugging in LLDB
 | |
| DESCRIPTION: Configures the `DYLD_LIBRARY_PATH` environment variable specifically for debugging Lance Python projects within LLDB. This ensures that the dynamic linker can find necessary shared libraries located in the third-party distribution directory.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/__wiki__/Debug.md#_snippet_1
 | |
| 
 | |
| LANGUAGE: lldb
 | |
| CODE:
 | |
| ```
 | |
| # /path/to/lance/python/.lldbinit
 | |
| env DYLD_LIBRARY_PATH=/path/to/thirdparty/dist/lib:${DYLD_LIBRARY_PATH}
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Download and extract MeCab Ipadic model
 | |
| DESCRIPTION: This snippet downloads the gzipped tarball of the MeCab Ipadic model from GitHub and then extracts its contents using tar. This is the first step in preparing the dictionary for building.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_0
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| curl -L -o mecab-ipadic-2.7.0-20070801.tar.gz "https://github.com/lindera-morphology/mecab-ipadic/archive/refs/tags/2.7.0-20070801.tar.gz"
 | |
| tar xvf mecab-ipadic-2.7.0-20070801.tar.gz
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build user dictionary with Lindera
 | |
| DESCRIPTION: This command demonstrates how to build a custom user dictionary using 'lindera build'. It takes a CSV file as input and creates a new user dictionary, which can be used to extend the base language model.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_2
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| lindera build --build-user-dictionary --dictionary-kind=ipadic user_dict/userdict.csv user_dict2
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Download Jieba Language Model
 | |
| DESCRIPTION: Command-line instruction to download the Jieba language model for use with LanceDB. The model will be automatically stored in the default Jieba model directory within the configured language model home.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/guide/tokenizer.md#_snippet_1
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| python -m lance.download jieba
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Read and Inspect Lance Dataset in Rust
 | |
| DESCRIPTION: This Rust function `read_dataset` shows how to open an existing Lance dataset from a given path. It uses a `scanner` to create a `batch_stream` and then iterates through each `RecordBatch`, printing its number of rows, columns, schema, and the entire batch content.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_1
 | |
| 
 | |
| LANGUAGE: Rust
 | |
| CODE:
 | |
| ```
 | |
| // Reads dataset from the given path and prints batch size, schema for all record batches. Also extracts and prints a slice from the first batch
 | |
| async fn read_dataset(data_path: &str) {
 | |
|     let dataset = Dataset::open(data_path).await.unwrap();
 | |
|     let scanner = dataset.scan();
 | |
| 
 | |
|     let mut batch_stream = scanner.try_into_stream().await.unwrap().map(|b| b.unwrap());
 | |
| 
 | |
|     while let Some(batch) = batch_stream.next().await {
 | |
|         println!("Batch size: {}, {}", batch.num_rows(), batch.num_columns()); // print size of batch
 | |
|         println!("Schema: {:?}", batch.schema()); // print schema of recordbatch
 | |
| 
 | |
|         println!("Batch: {:?}", batch); // print the entire recordbatch (schema and data)
 | |
|     }
 | |
| } // End read dataset
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Create a Lance Dataset from Arrow RecordBatches in Rust
 | |
| DESCRIPTION: Demonstrates how to write a collection of Arrow RecordBatches and an Arrow Schema into a new Lance Dataset. It uses default write parameters and an iterator for the batches.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_1
 | |
| 
 | |
| LANGUAGE: rust
 | |
| CODE:
 | |
| ```
 | |
| use lance::{dataset::WriteParams, Dataset};
 | |
| 
 | |
| let write_params = WriteParams::default();
 | |
| let mut reader = RecordBatchIterator::new(
 | |
|     batches.into_iter().map(Ok),
 | |
|     schema
 | |
| );
 | |
| Dataset::write(reader, &uri, Some(write_params)).await.unwrap();
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build Ipadic language model with Lindera
 | |
| DESCRIPTION: This command uses the 'lindera build' tool to compile the Ipadic dictionary. It specifies the dictionary kind as 'ipadic' and points to the extracted model directory to create the main dictionary.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/python/python/tests/models/lindera/README.md#_snippet_1
 | |
| 
 | |
| LANGUAGE: bash
 | |
| CODE:
 | |
| ```
 | |
| lindera build --dictionary-kind=ipadic mecab-ipadic-2.7.0-20070801 main
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Write Lance Dataset in Rust
 | |
| DESCRIPTION: This Rust function `write_dataset` demonstrates how to create and write a Lance dataset to a specified path. It defines a schema with `UInt32` fields, creates a `RecordBatch` with sample data, and uses `WriteParams` to set the write mode to `Overwrite` before writing the dataset to disk.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/docs/src/examples/rust/write_read_dataset.md#_snippet_0
 | |
| 
 | |
| LANGUAGE: Rust
 | |
| CODE:
 | |
| ```
 | |
| // Writes sample dataset to the given path
 | |
| async fn write_dataset(data_path: &str) {
 | |
|     // Define new schema
 | |
|     let schema = Arc::new(Schema::new(vec![
 | |
|         Field::new("key", DataType::UInt32, false),
 | |
|         Field::new("value", DataType::UInt32, false),
 | |
|     ]));
 | |
| 
 | |
|     // Create new record batches
 | |
|     let batch = RecordBatch::try_new(
 | |
|         schema.clone(),
 | |
|         vec![
 | |
|             Arc::new(UInt32Array::from(vec![1, 2, 3, 4, 5, 6])),
 | |
|             Arc::new(UInt32Array::from(vec![6, 7, 8, 9, 10, 11])),
 | |
|         ],
 | |
|     )
 | |
|     .unwrap();
 | |
| 
 | |
|     let batches = RecordBatchIterator::new([Ok(batch)], schema.clone());
 | |
| 
 | |
|     // Define write parameters (e.g. overwrite dataset)
 | |
|     let write_params = WriteParams {
 | |
|         mode: WriteMode::Overwrite,
 | |
|         ..Default::default()
 | |
|     };
 | |
| 
 | |
|     Dataset::write(batches, data_path, Some(write_params))
 | |
|         .await
 | |
|         .unwrap();
 | |
| } // End write dataset
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Build LanceDB Rust JNI Module
 | |
| DESCRIPTION: Specifies the command to build only the Rust-based JNI (Java Native Interface) module of LanceDB. This is useful for developers focusing on the native components without rebuilding the entire Java project.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/java/README.md#_snippet_10
 | |
| 
 | |
| LANGUAGE: shell
 | |
| CODE:
 | |
| ```
 | |
| cargo build
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Read a Lance Dataset and Collect RecordBatches in Rust
 | |
| DESCRIPTION: Opens an existing Lance Dataset from a specified path, scans its content, and collects all resulting RecordBatches into a vector. Error handling is included.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_2
 | |
| 
 | |
| LANGUAGE: rust
 | |
| CODE:
 | |
| ```
 | |
| let dataset = Dataset::open(path).await.unwrap();
 | |
| let mut scanner = dataset.scan();
 | |
| let batches: Vec<RecordBatch> = scanner
 | |
|     .try_into_stream()
 | |
|     .await
 | |
|     .unwrap()
 | |
|     .map(|b| b.unwrap())
 | |
|     .collect::<Vec<RecordBatch>>()
 | |
|     .await;
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Create a Vector Index on a Lance Dataset in Rust
 | |
| DESCRIPTION: Demonstrates how to create a vector index on a specified column (e.g., 'embeddings') within a Lance Dataset. It configures vector index parameters like the number of partitions and sub-vectors, noting potential alignment requirements.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_4
 | |
| 
 | |
| LANGUAGE: rust
 | |
| CODE:
 | |
| ```
 | |
| use ::lance::index::vector::VectorIndexParams;
 | |
| 
 | |
| let params = VectorIndexParams::default();
 | |
| params.num_partitions = 256;
 | |
| params.num_sub_vectors = 16;
 | |
| 
 | |
| // this will Err if list_size(embeddings) / num_sub_vectors does not meet simd alignment
 | |
| dataset.create_index(&["embeddings"], IndexType::Vector, None, ¶ms, true).await;
 | |
| ```
 | |
| 
 | |
| ----------------------------------------
 | |
| 
 | |
| TITLE: Retrieve Specific Records from a Lance Dataset in Rust
 | |
| DESCRIPTION: Retrieves specific records from a Lance Dataset based on their indices and a projection. The result is a RecordBatch containing the requested data.
 | |
| 
 | |
| SOURCE: https://github.com/lancedb/lance/blob/main/rust/lance/README.md#_snippet_3
 | |
| 
 | |
| LANGUAGE: rust
 | |
| CODE:
 | |
| ```
 | |
| let values: Result<RecordBatch> = dataset.take(&[200, 199, 39, 40, 100], &projection).await;
 | |
| ``` |