info_tfgrid/collections/technology/qsss/manual/qsfs_setup.md

298 lines
11 KiB
Markdown

# QSFS getting started on ubuntu setup
## Get components
The following steps can be followed to set up a qsfs instance on a fresh
ubuntu instance.
- Install the fuse kernel module (`apt-get update && apt-get install fuse3`)
- Install the individual components, by downloading the latest release from the
respective release pages:
- 0-db-fs: https://github.com/threefoldtech/0-db-fs/releases
- 0-db: https://github.com/threefoldtech/0-db, if multiple binaries
are available in the assets, choose the one ending in `static`
- 0-stor: https://github.com/threefoldtech/0-stor_v2/releases, if
multiple binaries are available in the assets, choose the one
ending in `musl`
- Make sure all binaries are executable (`chmod +x $binary`)
## Setup and run 0-stor
There are instructions below for a local 0-stor configuration. You can also deploy an eVDC and use the [provided 0-stor configuration](evdc_storage) for a simple cloud hosted solution.
We will run 6 0-db instances as backends for 0-stor. 4 are used for the
metadata, 2 are used for the actual data. The metadata always consists
of 4 nodes. The data backends can be increased. You can choose to either
run 7 separate 0-db processes, or a single process with 7 namespaces.
For the purpose of this setup, we will start 7 separate processes, as
such:
> This assumes you have moved the download 0-db binary to `/tmp/0-db`
```bash
/tmp/0-db --background --mode user --port 9990 --data /tmp/zdb-meta/zdb0/data --index /tmp/zdb-meta/zdb0/index
/tmp/0-db --background --mode user --port 9991 --data /tmp/zdb-meta/zdb1/data --index /tmp/zdb-meta/zdb1/index
/tmp/0-db --background --mode user --port 9992 --data /tmp/zdb-meta/zdb2/data --index /tmp/zdb-meta/zdb2/index
/tmp/0-db --background --mode user --port 9993 --data /tmp/zdb-meta/zdb3/data --index /tmp/zdb-meta/zdb3/index
/tmp/0-db --background --mode seq --port 9980 --data /tmp/zdb-data/zdb0/data --index /tmp/zdb-data/zdb0/index
/tmp/0-db --background --mode seq --port 9981 --data /tmp/zdb-data/zdb1/data --index /tmp/zdb-data/zdb1/index
/tmp/0-db --background --mode seq --port 9982 --data /tmp/zdb-data/zdb2/data --index /tmp/zdb-data/zdb2/index
```
Now that the data storage is running, we can create the config file for
0-stor. The (minimal) config for this example setup will look as follows:
```toml
minimal_shards = 2
expected_shards = 3
redundant_groups = 0
redundant_nodes = 0
socket = "/tmp/zstor.sock"
prometheus_port = 9100
zdb_data_dir_path = "/tmp/zdbfs/data/zdbfs-data"
max_zdb_data_dir_size = 25600
[encryption]
algorithm = "AES"
key = "000001200000000001000300000004000a000f00b00000000000000000000000"
[compression]
algorithm = "snappy"
[meta]
type = "zdb"
[meta.config]
prefix = "someprefix"
[meta.config.encryption]
algorithm = "AES"
key = "0101010101010101010101010101010101010101010101010101010101010101"
[[meta.config.backends]]
address = "[::1]:9990"
[[meta.config.backends]]
address = "[::1]:9991"
[[meta.config.backends]]
address = "[::1]:9992"
[[meta.config.backends]]
address = "[::1]:9993"
[[groups]]
[[groups.backends]]
address = "[::1]:9980"
[[groups.backends]]
address = "[::1]:9981"
[[groups.backends]]
address = "[::1]:9982"
```
> A full explanation of all options can be found in the 0-stor readme:
https://github.com/threefoldtech/0-stor_v2/#config-file-explanation
This guide assumes the config file is saved as `/tmp/zstor_config.toml`.
Now `zstor` can be started. Assuming the downloaded binary was saved as
`/tmp/zstor`:
`/tmp/zstor -c /tmp/zstor_config.toml monitor`. If you don't want the
process to block your terminal, you can start it in the background:
`nohup /tmp/zstor -c /tmp/zstor_config.toml monitor &`.
## Setup and run 0-db
First we will get the hook script. The hook script can be found in the
[quantum_storage repo on github](https://github.com/threefoldtech/quantum-storage).
A slightly modified version is found here:
```bash
#!/usr/bin/env bash
set -ex
action="$1"
instance="$2"
zstorconf="/tmp/zstor_config.toml"
zstorbin="/tmp/zstor"
if [ "$action" == "ready" ]; then
${zstorbin} -c ${zstorconf} test
exit $?
fi
if [ "$action" == "jump-index" ]; then
namespace=$(basename $(dirname $3))
if [ "${namespace}" == "zdbfs-temp" ]; then
# skipping temporary namespace
exit 0
fi
tmpdir=$(mktemp -p /tmp -d zdb.hook.XXXXXXXX.tmp)
dirbase=$(dirname $3)
# upload dirty index files
for dirty in $5; do
file=$(printf "i%d" $dirty)
cp ${dirbase}/${file} ${tmpdir}/
done
${zstorbin} -c ${zstorconf} store -s -d -f ${tmpdir} -k ${dirbase} &
exit 0
fi
if [ "$action" == "jump-data" ]; then
namespace=$(basename $(dirname $3))
if [ "${namespace}" == "zdbfs-temp" ]; then
# skipping temporary namespace
exit 0
fi
# backup data file
${zstorbin} -c ${zstorconf} store -s --file "$3"
exit 0
fi
if [ "$action" == "missing-data" ]; then
# restore missing data file
${zstorbin} -c ${zstorconf} retrieve --file "$3"
exit $?
fi
# unknown action
exit 1
```
> This guide assumes the file is saved as `/tmp/zdbfs/zdb-hook.sh. Make sure the
> file is executable, i.e. chmod +x /tmp/zdbfs/zdb-hook.sh`
The local 0-db which is used by 0-db-fs can be started as follows:
```bash
/tmp/0-db \
--index /tmp/zdbfs/index \
--data /tmp/zdbfs/data \
--datasize 67108864 \
--mode seq \
--hook /tmp/zdbfs/zdb-hook.sh \
--background
```
## Setup and run 0-db-fs
Finally, we will start 0-db-fs. This guides opts to mount the fuse
filesystem in `/mnt`. Again, assuming the 0-db-fs binary was saved as
`/tmp/0-db-fs`:
```bash
/tmp/0-db-fs /mnt -o autons -o background
```
You should now have the qsfs filesystem mounted at `/mnt`. As you write
data, it will save it in the local 0-db, and it's data containers will
be periodically encoded and uploaded to the backend data storage 0-db's.
The data files in the local 0-db will never occupy more than 25GiB of
space (as configured in the 0-stor config file). If a data container is
removed due to space constraints, and data inside of it needs to be
accessed by the filesystem (e.g. a file is being read), then the data
container is recovered from the backend storage 0-db's by 0-stor, and
0-db can subsequently serve this data to 0-db-fs.
### 0-db-fs limitation
Any workload should be supported on this filesystem, with some exceptions:
- Opening a file in 'always append mode' will not have the expected behavior
- There is no support of O_TMPFILE by fuse layer, which is a feature required by
overlayfs, thus this is not supported. Overlayfs is used by Docker for example.
## docker setup
It is possible to run the zstor in a docker container. First, create a data directory
on your host. Then, save the config file in the data directory as `zstor.toml`. Ensure
the storage 0-db's are running as desribed above. Then, run the docker container
as such:
```
docker run -ti --privileged --rm --network host --name fstest -v /path/to/data:/data -v /mnt:/mnt:shared azmy/qsfs
```
The filesystem is now available in `/mnt`.
## Autorepair
Autorepair automatically repairs object stored in the backend when one or more shards
are not reachable anymore. It does this by periodically checking if all the backends
are still reachable. If it detects that one or more of the backends used by an encoded
object are not reachable, the healthy shards are downloaded, the object is restored
and encoded again (possibly with a new config, if it has since changed), and uploaded
again.
Autorepair does not validate the integrity of individual shards. This is protectected
against by having multiple spare (redundant) shards for an object. Corrupt shards
are detected when the object is rebuild, and removed before attempting to rebuild.
Autorepair also does not repair the metadata of objects.
## Monitoring, alerting and statistics
0-stor collects metrics about the system. It can be configured with a 0-db-fs mountpoint,
which will trigger 0-stor to collect 0-db-fs statistics, next to some 0-db statistics
which are always collected. If the `prometheus_port` config option is set, 0-stor
will serve metrics on this port for scraping by prometheus. You can then set up
graphs and alerts in grafana. Some examples include: disk space used vs available
per 0-db backend, total entries in 0-db backends, which backends are tracked, ...
When 0-db-fs monitoring is enabled, statistics are also exported about the filesystem
itself, such as read/write speeds, syscalls, and internal metrics
For a full overview of all available stats, you can set up a prometheus scraper against
a running instance, and use the embedded promQl to see everything available.
## Data safety
As explained in the auto repair section, data is periodically checked and rebuild if
0-db backends become unreachable. This ensures that data, once stored, remains available,
as long as the metadata is still present. When needed, the system can be expanded with more
0-db backends, and the encoding config can be changed if needed (e.g. to change encryption keys).
## Performance
Qsfs is not a high speed filesystem, nor is it a distributed filesystem. It is intended to
be used for archive purposes. For this reason, the qsfs stack focusses on data safety first.
Where needed, reliability is chosen over availability (i.e. we won't write data if we can't
guarantee all the conditions in the required storage profile is met).
With that being said, there are currently 2 limiting factors in the setup:
- speed of the disk on which the local 0-db is running
- network
The first is the speed of the disk for the local 0-db. This imposes a hard limit on
the throughput of the filesystem. Performance testing has shown that write speeds
on the filesystem reach performance of roughly 1/3rd of the raw performance of the
disk for writing, and 1/2nd of the read performance. Note that in the case of _very_
fast disks (mostly NVMe SSD's), the cpu might become a bottleneck if it is old and
has a low clock speed. Though this should not be a problem.
The network is more of a soft cap. All 0-db data files will be encoded and distributed
over the network. This means that the upload speed of the node needs to be able to
handle this data througput. In the case of random data (which is not compressable),
the required upload speed would be the write speed of the 0-db-fs, increased by the
overhead generated by the storage policy. There is no feedback to 0-db-fs if the upload
of data is lagging behind. This means that in cases where a sustained high speed write
load is applied, the local 0-db might eventually grow bigger than the configured size limit
until the upload managed to catch up. If this happens for prolonged periods of time, it
is technically possible to run out of space on the disk. For this reason, you should
always have some extra space available on the disk to account for temprorary cache
excess.
When encoded data needs to be recovered from backend nodes (if it is not in cache),
the read speed will be equal to the connection speed of the slowest backend, as all
shards are recovered before the data is build. This means that recovery of historical
data will generally be a slow process. Since we primarily focus on archive storage,
we do not consider this a priority.