Files
herolib/libarchive/dedupestor/dedupe_ourdb/README.md
2025-10-13 05:36:06 +04:00

108 lines
3.5 KiB
Markdown

# DedupeStore
DedupeStore is a content-addressable key-value store with built-in deduplication. It uses blake2b-160 content hashing to identify and deduplicate data, making it ideal for storing files or data blocks where the same content might appear multiple times.
## Features
- Content-based deduplication using blake2b-160 hashing
- Efficient storage using RadixTree for hash lookups
- Persistent storage using OurDB
- Maximum value size limit of 1MB
- Fast retrieval of data using content hash
- Automatic deduplication of identical content
## Usage
```v
import incubaid.herolib.data.dedupestor
// Create a new dedupestore
mut ds := dedupestor.new(
path: 'path/to/store'
reset: false // Set to true to reset existing data
)!
// Store some data
data := 'Hello, World!'.bytes()
hash := ds.store(data)!
println('Stored data with hash: ${hash}')
// Retrieve data using hash
retrieved := ds.get(hash)!
println('Retrieved data: ${retrieved.bytestr()}')
// Check if data exists
exists := ds.exists(hash)
println('Data exists: ${exists}')
// Attempting to store the same data again returns the same hash
same_hash := ds.store(data)!
assert hash == same_hash // True, data was deduplicated
```
## Implementation Details
DedupeStore uses two main components for storage:
1. **RadixTree**: Stores mappings from content hashes to data location IDs
2. **OurDB**: Stores the actual data blocks
When storing data:
1. The data is hashed using blake2b-160
2. If the hash exists in the RadixTree, the existing data location is returned
3. If the hash is new:
- Data is stored in OurDB, getting a new location ID
- Hash -> ID mapping is stored in RadixTree
- The hash is returned
When retrieving data:
1. The RadixTree is queried with the hash to get the data location ID
2. The data is retrieved from OurDB using the ID
## Size Limits
- Maximum value size: 1MB
- Attempting to store larger values will result in an error
## the reference field
In the dedupestor system, the Reference struct is defined with two fields:
```v
pub struct Reference {
pub:
owner u16
id u32
}
```
The purpose of the id field in this context is to serve as an identifier within a specific owner's domain. Here's what each field represents:
owner (u16): Identifies which entity or system component "owns" or is referencing the data. This could represent different applications, users, or subsystems that are using the dedupestor.
id (u32): A unique identifier within that owner's domain. This allows each owner to have their own independent numbering system for referencing stored data.
Together, the {owner: 1, id: 100} combination creates a unique reference that:
Tracks which entities are referencing a particular piece of data
Allows the system to know when data can be safely deleted (when no references remain)
Provides a way for different components to maintain their own ID systems without conflicts
The dedupestor uses these references to implement a reference counting mechanism. When data is stored, a reference is attached to it. When all references to a piece of data are removed (via the delete method), the actual data can be safely deleted from storage.
This design allows for efficient deduplication - if the same data is stored multiple times with different references, it's only physically stored once, but the system keeps track of all the references to it.
## Testing
The module includes comprehensive tests covering:
- Basic store/retrieve operations
- Deduplication functionality
- Size limit enforcement
- Edge cases
Run tests with:
```bash
v test lib/data/dedupestor/