herolib/libarchive/dedupestor/dedupe_ourdb/README.md

# DedupeStore

DedupeStore is a content-addressable key-value store with built-in deduplication. It uses blake2b-160 content hashing to identify and deduplicate data, making it ideal for storing files or data blocks where the same content might appear multiple times.

## Features

- Content-based deduplication using blake2b-160 hashing
- Efficient storage using RadixTree for hash lookups
- Persistent storage using OurDB
- Maximum value size limit of 1MB
- Fast retrieval of data using content hash
- Automatic deduplication of identical content

## Usage

```v
import incubaid.herolib.data.dedupestor

// Create a new dedupestore
mut ds := dedupestor.new(
    path: 'path/to/store'
    reset: false // Set to true to reset existing data
)!

// Store some data
data := 'Hello, World!'.bytes()
hash := ds.store(data)!
println('Stored data with hash: ${hash}')

// Retrieve data using hash
retrieved := ds.get(hash)!
println('Retrieved data: ${retrieved.bytestr()}')

// Check if data exists
exists := ds.exists(hash)
println('Data exists: ${exists}')

// Attempting to store the same data again returns the same hash
same_hash := ds.store(data)!
assert hash == same_hash // True, data was deduplicated

```

## Implementation Details

DedupeStore uses two main components for storage:

1. **RadixTree**: Stores mappings from content hashes to data location IDs
2. **OurDB**: Stores the actual data blocks

When storing data:

1. The data is hashed using blake2b-160
2. If the hash exists in the RadixTree, the existing data location is returned
3. If the hash is new:
   - Data is stored in OurDB, getting a new location ID
   - Hash -> ID mapping is stored in RadixTree
   - The hash is returned

When retrieving data:

1. The RadixTree is queried with the hash to get the data location ID
2. The data is retrieved from OurDB using the ID

## Size Limits

- Maximum value size: 1MB
- Attempting to store larger values will result in an error

## the reference field

In the dedupestor system, the Reference struct is defined with two fields:

```v
pub struct Reference {
pub:
 owner u16
 id u32
}
```

The purpose of the id field in this context is to serve as an identifier within a specific owner's domain. Here's what each field represents:

owner (u16): Identifies which entity or system component "owns" or is referencing the data. This could represent different applications, users, or subsystems that are using the dedupestor.
id (u32): A unique identifier within that owner's domain. This allows each owner to have their own independent numbering system for referencing stored data.
Together, the {owner: 1, id: 100} combination creates a unique reference that:

Tracks which entities are referencing a particular piece of data
Allows the system to know when data can be safely deleted (when no references remain)
Provides a way for different components to maintain their own ID systems without conflicts
The dedupestor uses these references to implement a reference counting mechanism. When data is stored, a reference is attached to it. When all references to a piece of data are removed (via the delete method), the actual data can be safely deleted from storage.

This design allows for efficient deduplication - if the same data is stored multiple times with different references, it's only physically stored once, but the system keeps track of all the references to it.

## Testing

The module includes comprehensive tests covering:

- Basic store/retrieve operations
- Deduplication functionality
- Size limit enforcement
- Edge cases

Run tests with:

```bash
v test lib/data/dedupestor/