108 lines
3.5 KiB
Markdown
108 lines
3.5 KiB
Markdown
# DedupeStore
|
|
|
|
DedupeStore is a content-addressable key-value store with built-in deduplication. It uses blake2b-160 content hashing to identify and deduplicate data, making it ideal for storing files or data blocks where the same content might appear multiple times.
|
|
|
|
## Features
|
|
|
|
- Content-based deduplication using blake2b-160 hashing
|
|
- Efficient storage using RadixTree for hash lookups
|
|
- Persistent storage using OurDB
|
|
- Maximum value size limit of 1MB
|
|
- Fast retrieval of data using content hash
|
|
- Automatic deduplication of identical content
|
|
|
|
## Usage
|
|
|
|
```v
|
|
import incubaid.herolib.data.dedupestor
|
|
|
|
// Create a new dedupestore
|
|
mut ds := dedupestor.new(
|
|
path: 'path/to/store'
|
|
reset: false // Set to true to reset existing data
|
|
)!
|
|
|
|
// Store some data
|
|
data := 'Hello, World!'.bytes()
|
|
hash := ds.store(data)!
|
|
println('Stored data with hash: ${hash}')
|
|
|
|
// Retrieve data using hash
|
|
retrieved := ds.get(hash)!
|
|
println('Retrieved data: ${retrieved.bytestr()}')
|
|
|
|
// Check if data exists
|
|
exists := ds.exists(hash)
|
|
println('Data exists: ${exists}')
|
|
|
|
// Attempting to store the same data again returns the same hash
|
|
same_hash := ds.store(data)!
|
|
assert hash == same_hash // True, data was deduplicated
|
|
|
|
```
|
|
|
|
## Implementation Details
|
|
|
|
DedupeStore uses two main components for storage:
|
|
|
|
1. **RadixTree**: Stores mappings from content hashes to data location IDs
|
|
2. **OurDB**: Stores the actual data blocks
|
|
|
|
When storing data:
|
|
|
|
1. The data is hashed using blake2b-160
|
|
2. If the hash exists in the RadixTree, the existing data location is returned
|
|
3. If the hash is new:
|
|
- Data is stored in OurDB, getting a new location ID
|
|
- Hash -> ID mapping is stored in RadixTree
|
|
- The hash is returned
|
|
|
|
When retrieving data:
|
|
|
|
1. The RadixTree is queried with the hash to get the data location ID
|
|
2. The data is retrieved from OurDB using the ID
|
|
|
|
## Size Limits
|
|
|
|
- Maximum value size: 1MB
|
|
- Attempting to store larger values will result in an error
|
|
|
|
## the reference field
|
|
|
|
In the dedupestor system, the Reference struct is defined with two fields:
|
|
|
|
```v
|
|
pub struct Reference {
|
|
pub:
|
|
owner u16
|
|
id u32
|
|
}
|
|
```
|
|
|
|
The purpose of the id field in this context is to serve as an identifier within a specific owner's domain. Here's what each field represents:
|
|
|
|
owner (u16): Identifies which entity or system component "owns" or is referencing the data. This could represent different applications, users, or subsystems that are using the dedupestor.
|
|
id (u32): A unique identifier within that owner's domain. This allows each owner to have their own independent numbering system for referencing stored data.
|
|
Together, the {owner: 1, id: 100} combination creates a unique reference that:
|
|
|
|
Tracks which entities are referencing a particular piece of data
|
|
Allows the system to know when data can be safely deleted (when no references remain)
|
|
Provides a way for different components to maintain their own ID systems without conflicts
|
|
The dedupestor uses these references to implement a reference counting mechanism. When data is stored, a reference is attached to it. When all references to a piece of data are removed (via the delete method), the actual data can be safely deleted from storage.
|
|
|
|
This design allows for efficient deduplication - if the same data is stored multiple times with different references, it's only physically stored once, but the system keeps track of all the references to it.
|
|
|
|
## Testing
|
|
|
|
The module includes comprehensive tests covering:
|
|
|
|
- Basic store/retrieve operations
|
|
- Deduplication functionality
|
|
- Size limit enforcement
|
|
- Edge cases
|
|
|
|
Run tests with:
|
|
|
|
```bash
|
|
v test lib/data/dedupestor/
|