Content Processing Pipeline

This document describes the pipeline for processing raw content files (markdown and images) from collection directories before they are made available on IPFS and referenced in the website's metadata. This process ensures content integrity, enables encryption, and prepares files for decentralized storage.

Pipeline Steps

The content processing pipeline involves the following steps for each file found within a collection directory (identified by the presence of a .collection file):

File Discovery: Identify all files directly within a collection directory. Subdirectories and their contents are ignored.
Filename Normalization: The filename is converted to lowercase and snake_case. This normalized filename is used for addressing the content within the website (as part of the page path).
Original Content Hashing (Blake): The original, unencrypted content of the file is hashed using the Blake algorithm. This Blake hash serves two purposes:
- It acts as the symmetric encryption key for the file content.
- It is stored in the metadata as part of the key to retrieve and decrypt the content.
Content Encryption: The original file content is encrypted using a very strong symmetric encryption scheme. The Blake hash calculated in the previous step is used as the encryption key. The specific encryption algorithm and implementation details should be chosen to ensure high security.
Encrypted Content Upload to IPFS: The resulting encrypted content is uploaded to the IPFS network. The IPFS network processes the encrypted content and returns a Content Identifier (CID), which is a hash of the encrypted content. This CID is the address on IPFS where the encrypted file can be retrieved.
Metadata Key Generation: A key is generated for the metadata by concatenating the Blake hash (the encryption key) and the IPFS hash (CID) of the encrypted content, separated by a space. This key is stored as a single string in the pages list metadata, associated with the normalized page path (collection_name/normalized_filename_without_extension).

Metadata Key Format

As described in specs/metadata_structure.md, the key stored in the pages list metadata for each page will be a single string with the following format:

blake_hash encrypted_ipfs_hash

For example:

b2b6d... Qm...

The Blake hash has a fixed size depending on the specific Blake variant used. This fixed size can be used to reliably separate the Blake hash from the IPFS hash within the combined key string.

This format allows the browser-hosted website to:

Parse the key string to extract the blake_hash and encrypted_ipfs_hash.
Retrieve the encrypted content from IPFS using the encrypted_ipfs_hash.
Decrypt the retrieved content using the blake_hash as the encryption key.

Security Implications

Confidentiality: Encrypting the content ensures that only someone with the correct Blake hash (the key) can decrypt and view the original content.
Integrity: The IPFS hash of the encrypted content guarantees that the retrieved encrypted data has not been tampered with. The Blake hash of the original content provides an additional integrity check after decryption.
Key Management: The security of this system relies heavily on how the Blake hash (the encryption key) is managed and distributed. While the hash is stored in the public metadata, the assumption is that the original content is not easily guessable or brute-forceable from its Blake hash alone, and that the encryption scheme is strong.

This processing pipeline ensures that content is securely stored and retrieved, leveraging the decentralized and immutable nature of IPFS while adding a layer of confidentiality through encryption.

3.7 KiB Raw Blame History

Content Processing Pipeline

Pipeline Steps

Metadata Key Format

Security Implications

3.7 KiB

Raw Blame History