productize CI/CD on TFGrid #28

New Issue

sashaastiadi · 2024-03-20T13:08:42Z

sashaastiadi commented

2024-03-20 13:08:42 +00:00

CI/CD can use lots of our capacity and its perfect for our deployment
supports gitea & github
for now we only support github actions format but eventually we want to support other formats e.g. woodpecker (which in my opinion is much nicer)

todo

define the product
make a quick website using our framework
create manual specific for this usecase
find a new name for it, and then at bottom say powered by TF
do competitive review, see where we excel

VM's needed

VM: Act Runner (gitea actions runner): see https://docs.gitea.com/usage/actions/overview, has Mycelium Inside
VM: Github Actions Runners
- see here)

Requirements

mycelium inside
integrated in tfrobot
users need to be able to specify the nodes they want to have (as in tfrobot for easy deploy), based on this selection and min required nodes it launches and keeps them up and running
users can specify min nr of VM's, and max nr of VM's

optional

tfrobot monitors CPU utilization of the VM's if too high (means too many deployments) will extend and add more VM's, how do we communicate with Github / Gitea
support woodpecker

questions

can we do multi architecture (arch/amd 64) so arch on amd through qemu, I think cloud cloud hypervisor can do it
can it run on mycelium + nat only? I hope we don't need IPV4 addr

remark

will only use TFT

ideas

we will support multiple types of runners in future

- CI/CD can use lots of our capacity and its perfect for our deployment - supports gitea & github - for now we only support github actions format but eventually we want to support other formats e.g. woodpecker (which in my opinion is much nicer) ## todo - [ ] define the product - [ ] make a quick website using our framework - [ ] create manual specific for this usecase - [ ] find a new name for it, and then at bottom say powered by TF - [ ] do competitive review, see where we excel ## VM's needed - VM: Act Runner (gitea actions runner): see https://docs.gitea.com/usage/actions/overview, has Mycelium Inside - VM: Github Actions Runners - see [here](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners)) # Requirements - mycelium inside - integrated in tfrobot - users need to be able to specify the nodes they want to have (as in tfrobot for easy deploy), based on this selection and min required nodes it launches and keeps them up and running - users can specify min nr of VM's, and max nr of VM's ## optional - tfrobot monitors CPU utilization of the VM's if too high (means too many deployments) will extend and add more VM's, how do we communicate with Github / Gitea - support woodpecker ## questions - can we do multi architecture (arch/amd 64) so arch on amd through qemu, I think cloud cloud hypervisor can do it - can it run on mycelium + nat only? I hope we don't need IPV4 addr ## remark - will only use TFT ## ideas - we will support multiple types of runners in future

sashaastiadi commented

2024-03-20 13:08:42 +00:00

should we consider zeroCI? it was working natively on the grid with a nice interface?

sashaastiadi commented

2024-03-20 13:08:43 +00:00

ofcourse we should make it part of the package, people can select which one they deploy, lets do a demo again

sashaastiadi commented

2024-03-20 13:08:43 +00:00

@despiegk I think this is amazing. One question concerning this:

"create manual specific for this usecase"

Do you want a whole new mdbook, or do we simply add a section to the TF Manual? (info_grid)

For now, we could quickly add a new section to the manual on zeroCI (with the demo). What do you think?

@despiegk I think this is amazing. One question concerning this: "create manual specific for this usecase" Do you want a whole new mdbook, or do we simply add a section to the TF Manual? (info_grid) For now, we could quickly add a new section to the manual on zeroCI (with the demo). What do you think?

sashaastiadi commented

2024-03-20 13:08:43 +00:00

questions

can we do multi architecture (arch/amd 64) so arch on amd through qemu, I think cloud cloud hypervisor can do it

The usual way of doing this is using Docker's buildx to handle QEMU and do the cross compiling inside a container. This should just work inside our VMs, with no need to emulate an entire VM.

can it run on mycelium + nat only? I hope we don't need IPV4 addr

For GitHub Actions, the runners (VMs) only need to be able to connect to GitHub via a outbound https connection (see here). Probably Gitea Actions works the same.

It's worth noting though that Github does allocate IP addresses to their runner VMs, and optionally static ones. I guess these are there for management of the runners as needed.

do competitive review, see where we excel

The obvious place we can compete is on price. Github's runners are marked up significantly as a convenience product.

Here's a post that not only covers price but a couple other potential benefits and does some comparison against hosting runners on AWS: https://www.linkedin.com/pulse/how-we-saved-15k-month-github-actions-part-1-trunkio

The figures given in the post for AWS machines are for spot pricing, not on demand, but the difference is still significant (price per minute for the base level machine of 2vcpu and 8gb ram):

GitHub hosted runner:         $.008
AWS m5.large on-demand:       $.0016
Grid pricing (no discount):   $.0003

I didn't find any services yet specifically offering just hosted Actions runners for price comparison. This might be part of the offering of some CI/CD services though.

> ## questions > > - can we do multi architecture (arch/amd 64) so arch on amd through qemu, I think cloud cloud hypervisor can do it The usual way of doing this is using Docker's `buildx` to handle QEMU and do the cross compiling inside a container. This should just work inside our VMs, with no need to emulate an entire VM. > - can it run on mycelium + nat only? I hope we don't need IPV4 addr For GitHub Actions, the runners (VMs) only need to be able to connect to GitHub via a outbound https connection (see [here](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners#communication-between-self-hosted-runners-and-github)). Probably Gitea Actions works the same. It's worth noting though that Github does allocate IP addresses to their runner VMs, and [optionally static ones](https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners/about-larger-runners#networking-for-larger-runners). I guess these are there for management of the runners as needed. > do competitive review, see where we excel The obvious place we can compete is on price. Github's runners are marked up significantly as a convenience product. Here's a post that not only covers price but a couple other potential benefits and does some comparison against hosting runners on AWS: https://www.linkedin.com/pulse/how-we-saved-15k-month-github-actions-part-1-trunkio The figures given in the post for AWS machines are for spot pricing, not on demand, but the difference is still significant (price per minute for the base level machine of 2vcpu and 8gb ram): ``` GitHub hosted runner: $.008 AWS m5.large on-demand: $.0016 Grid pricing (no discount): $.0003 ``` I didn't find any services yet specifically offering just hosted Actions runners for price comparison. This might be part of the offering of some CI/CD services though.

sashaastiadi commented

2024-03-20 13:08:44 +00:00

I did some research and testing of deploying Github and Gitea actions runners in Grid VMs.

For Github, dealing with runner tokens is somewhat complicated and leads to some limitations I describe toward the end.

The situation for Gitea seems better however. As of a recent change, runner registration tokens are now reusable on Gitea until revoked.

Here are my summarized findings about how runners work (some aspects are Github specific, but most probably applies to Gitea as well):

A runner is a process that can execute a single workflow job at a time. Multiple runners can exist on a single machine, sharing all available resources of the machine
Runners belong to one of three levels: repo, org, or enterprise
Jobs specify what type of runner they must use via a set of labels given in the runs_on field of the job. All self hosted runners are labeled self_hosted and other labels can be added (runners must match all labels of a job)
When a job is triggered, it will be assigned to any open runner meeting the criteria. If no runner is available, it will be queued for up to 24 hours
Github provides an autoscaling service for runners based for Kubernetes (probably not the approach for us, and we would still need to autoscale the cluster itself)
There are webhooks that delivery info about job status and neede machine type. I think we could ingest these webhooks through our gateways, but not sure. There's also REST endpoints for querying workflow run data, but I don't see any that provide the same data in a simple form like the webhooks
It's recommended to use "ephemeral" runners, which execute at most one job, for self implemented autoscaling solutions (is also a way to provide a fresh environment to each job)

Some Gitea specific notes

There are currently no API endpoints or webhooks for managing actions or runners in Gitea.

Some work is ongoing though:

https://github.com/go-gitea/gitea/issues/23796
https://github.com/go-gitea/gitea/issues/25572
https://github.com/go-gitea/gitea/issues/26370

There also seems to be no option for one time use or ephemeral runners. So scaling down runners while being sure not to interrupt any ongoing jobs could be a challenge on Gitea.

Given all that, here's a possible approach:

User deploys a manager service we develop that wraps or extends tfcmd/tfrobot.
- Required config:
  - Runner registration token (or API token, see caveats below)
  - Seed phrase with funded twin to deploy workers
- Maybe optional config:
  - Which farms or nodes to run workers on
  - Max run time for job before terminating
  - Max worker count (in case of issues accidentally spawning a lot of jobs, maybe not needed)
When a workflow is triggered from the users repo, the manager detects the queued jobs and spins up the appropriate worker VMs with an ephemeral runner inside. One each worker finishes its job (or potentially after a time limit has passed in case of "stuck" jobs), the VM is destroyed
Worker VMs can be of varying capacity levels, with a reasonable default (Github default runner is 4cpu 16gb ram, which might be a bit big). The manager checks the label and deploys the correct capacity VM:
- Some standard labels (small, medium, large, etc)
- Maybe also parse labels for custom capacity types (4cpu8gbram50gbssd for example, where the user can adjust the quantities freely. This is nice because no additional config needs to be passed to the manager up front—it can respond dynamically to whatever capacity requests the user adds to their jobs)
The worker images should include at least Ubuntu (there's a list of supported distros for the runner app here). These should probably ship with Docker to support Docker features in workflows

Caveats (Github only)

Runner registration tokens can be obtained in two ways, one is through the UI and the other is via the API. Both are reusable but expire in one hour
Generating runner tokens via the API requires an OAuth token scoped with "repo" (full access, read/write/delete/etc) for repos or admin access for orgs
But, for orgs there's also a "fine grained" token permission that only gives access to runner related endpoints
The API token used to generate the runner registration tokens will have an expiration date. This can be set custom for years in the future, but if the token does expire then no more runners can be created

This means that repo specific runners need a level of access that we probably shouldn't be asking users for. Org level on the other hand is doable. This isn't a huge limitation, but overall it requires additional config from the user and additional code in the manager app to refresh the runner registration tokens every hour.

See also this discussion.

I did some research and testing of deploying Github and Gitea actions runners in Grid VMs. For Github, dealing with runner tokens is somewhat complicated and leads to some limitations I describe toward the end. The situation for Gitea seems better however. As of a [recent change](https://github.com/go-gitea/gitea/pull/27304), runner registration tokens are now reusable on Gitea until revoked. **Here are my summarized findings about how runners work (some aspects are Github specific, but most probably applies to Gitea as well):** * A `runner` is a process that can execute a single workflow job at a time. Multiple runners can exist on a single machine, sharing all available resources of the machine * Runners belong to one of three levels: repo, org, or enterprise * Jobs specify what type of runner they must use via a set of labels given in the `runs_on` field of the job. All self hosted runners are labeled `self_hosted` and other labels can be added (runners must match *all* labels of a job) * When a job is triggered, it will be assigned to any open runner meeting the criteria. If no runner is available, it will be queued for up to 24 hours * Github provides an autoscaling service for runners based for Kubernetes (probably not the approach for us, and we would still need to autoscale the cluster itself) * There are [webhooks](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners#using-webhooks-for-autoscaling) that delivery info about job status and neede machine type. I think we could ingest these webhooks through our gateways, but not sure. There's also REST endpoints for querying workflow run data, but I don't see any that provide the same data in a simple form like the webhooks * It's recommended to use "ephemeral" runners, which execute at most one job, for self implemented autoscaling solutions (is also a way to provide a fresh environment to each job) **Some Gitea specific notes** There are currently no API endpoints or webhooks for managing actions or runners in Gitea. Some work is ongoing though: https://github.com/go-gitea/gitea/issues/23796 https://github.com/go-gitea/gitea/issues/25572 https://github.com/go-gitea/gitea/issues/26370 There also seems to be no option for one time use or ephemeral runners. So scaling down runners while being sure not to interrupt any ongoing jobs could be a challenge on Gitea. **Given all that, here's a possible approach:** 1. User deploys a manager service we develop that wraps or extends tfcmd/tfrobot. * Required config: * Runner registration token (or API token, see caveats below) * Seed phrase with funded twin to deploy workers * Maybe optional config: * Which farms or nodes to run workers on * Max run time for job before terminating * Max worker count (in case of issues accidentally spawning a lot of jobs, maybe not needed) 2. When a workflow is triggered from the users repo, the manager detects the queued jobs and spins up the appropriate worker VMs with an ephemeral runner inside. One each worker finishes its job (or potentially after a time limit has passed in case of "stuck" jobs), the VM is destroyed 3. Worker VMs can be of varying capacity levels, with a reasonable default (Github default runner is 4cpu 16gb ram, which might be a bit big). The manager checks the label and deploys the correct capacity VM: * Some standard labels (`small`, `medium`, `large`, etc) * Maybe also parse labels for custom capacity types (`4cpu8gbram50gbssd` for example, where the user can adjust the quantities freely. This is nice because no additional config needs to be passed to the manager up front—it can respond dynamically to whatever capacity requests the user adds to their jobs) 4. The worker images should include at least Ubuntu (there's a list of supported distros for the runner app [here](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners#linux)). These should probably ship with Docker to support Docker features in workflows ### Caveats (Github only) * Runner registration tokens can be obtained in two ways, one is through the UI and the other is via the API. Both are reusable but expire in one hour * Generating runner tokens via the API requires an OAuth token scoped with "repo" (full access, read/write/delete/etc) for [repos](https://docs.github.com/en/rest/actions/self-hosted-runners?apiVersion=2022-11-28#create-configuration-for-a-just-in-time-runner-for-a-repository) or admin access for orgs * But, for orgs there's also a "fine grained" [token permission](https://docs.github.com/en/rest/authentication/permissions-required-for-fine-grained-personal-access-tokens?apiVersion=2022-11-28#organization-permissions-for-self-hosted-runners) that only gives access to runner related endpoints * The API token used to generate the runner registration tokens will have an expiration date. This can be set custom for years in the future, but if the token does expire then no more runners can be created This means that repo specific runners need a level of access that we probably shouldn't be asking users for. Org level on the other hand is doable. This isn't a huge limitation, but overall it requires additional config from the user and additional code in the manager app to refresh the runner registration tokens every hour. See also this [discussion](https://github.com/orgs/community/discussions/53361).

sashaastiadi commented

2024-03-20 13:08:44 +00:00

Here's a project offering Github runners. Is self hostable with GNU Affero license:

https://www.ubicloud.com/use-cases/github-actions

Their managed service starts at $.0008, so 10x less than Github but still more than the Grid.

Source code repo too:

https://github.com/ubicloud/ubicloud

Maybe can learn something by studying it.

Here's a few others:

https://depot.dev/
https://buildjet.com/
https://www.warpbuild.com/

Here's a project offering Github runners. Is self hostable with GNU Affero license: https://www.ubicloud.com/use-cases/github-actions Their managed service starts at $.0008, so 10x less than Github but still more than the Grid. Source code repo too: https://github.com/ubicloud/ubicloud Maybe can learn something by studying it. Here's a few others: https://depot.dev/ https://buildjet.com/ https://www.warpbuild.com/

sashaastiadi commented

2024-03-20 13:08:44 +00:00

Here's another open source solution, for autoscaling runners on AWS: https://github.com/philips-labs/terraform-aws-github-runner

Some other notes and thoughts from my research:

Github runners are free for public repos, which means there's no market in open source projects on Github
Enterprises with private repos are going to be security conscious, and building their code in environments without some SOC might be a hard sell
I noticed that some bigger open source projects are using Gitlab on their own subdomain (like https://gitlab.alpinelinux.org/), but it's not immediately obvious if these are actually self hosted instances or not

So I think we need to considering our positioning (Zos may be secure, but any residual data on farmer's disks is not necessarily) and whether we can reach a sufficient market to make this worthwhile.

Here's another open source solution, for autoscaling runners on AWS: https://github.com/philips-labs/terraform-aws-github-runner Some other notes and thoughts from my research: * Github runners are free for public repos, which means there's no market in open source projects on Github * Enterprises with private repos are going to be security conscious, and building their code in environments without some [SOC](https://en.wikipedia.org/wiki/System_and_Organization_Controls) might be a hard sell * I noticed that some bigger open source projects are using Gitlab on their own subdomain (like https://gitlab.alpinelinux.org/), but it's not immediately obvious if these are actually self hosted instances or not So I think we need to considering our positioning (Zos may be secure, but any residual data on farmer's disks is not necessarily) and whether we can reach a sufficient market to make this worthwhile.

sashaastiadi commented

2024-03-20 13:08:45 +00:00

I think we can increase the appeal of Grid hosted runners by providing two optional features to protect user data:

Encryption of any data written to disk in the course of executing runner jobs (encrypted with ephemeral keys generated by each runner VM after it boots and only stored in RAM)
Use of RAM disk for storing user data

These approaches both have the advantage that any sensitive data that might be processed during runner jobs is inaccessible after the node is powered down. That means there's no chance of anyone harvesting user data from the node disks.

For encryption there are a number of options, with these looking most promising:

gocryptfs - this is a user space option that can be deployed on top of any filesystem, meaning it works in our micro VMs out of the box. Seems performance is good (if their own benchmarks are to be believed)
fscrypt - this is using the native encryption capabilities of ext4, for example. Won't work in a micro VM because the kernel we supply doesn't have proper support
dm-crypt - perhaps the best performance, but works on a full block device so is a bit less flexible (I didn't test it yet)

As for RAM disks, there are also a few options. There's a comparison here. The brd kernel module seems not an option again in our micro VMs, but tmpfs should be fine I think (assuming no swap).

The remaining challenge would be to ensure that each type of runner is only placing user data into the encrypted store, while ideally avoiding use of the encrypted area for non sensitive data.

I think we can increase the appeal of Grid hosted runners by providing two optional features to protect user data: 1. Encryption of any data written to disk in the course of executing runner jobs (encrypted with ephemeral keys generated by each runner VM after it boots and only stored in RAM) 2. Use of RAM disk for storing user data These approaches both have the advantage that any sensitive data that might be processed during runner jobs is inaccessible after the node is powered down. That means there's no chance of anyone harvesting user data from the node disks. For encryption there are a number of options, with these looking most promising: 1. `gocryptfs` - this is a user space option that can be deployed on top of any filesystem, meaning it works in our micro VMs out of the box. Seems performance is good (if their own benchmarks are to be believed) 2. `fscrypt` - this is using the native encryption capabilities of `ext4`, for example. Won't work in a micro VM because the kernel we supply doesn't have proper support 3. `dm-crypt` - perhaps the best performance, but works on a full block device so is a bit less flexible (I didn't test it yet) As for RAM disks, there are also a few options. There's a comparison [here](https://unix.stackexchange.com/a/491900). The `brd` kernel module seems not an option again in our micro VMs, but `tmpfs` should be fine I think (assuming no swap). The remaining challenge would be to ensure that each type of runner is only placing user data into the encrypted store, while ideally avoiding use of the encrypted area for non sensitive data.

sashaastiadi commented

2024-03-20 13:08:45 +00:00

Investigating the possibility of using encryption for runner VMs, the following points become apparent:

Docker has best performance when its underlying storage is ext4, btrfs, zfs, or xfs (there are some tradeoffs among these, but in general they are all better than the alternatives, which are the universally compatible fuse-overlayfs and vfs drivers)
Many runner jobs don't use Docker (the VM is the disposable unit instead), but it's an important use case and best to support it well
Using block device encryption (dm-crypt) or native filesystem encryption (fscrypt) are thus essential to providing good Docker performance with encryption
So far our micro VM kernel doesn't include either of the necessary drivers

With this in mind, I already have an example full VM image, built entirely in Docker, that can support kernel based encryption schemes. It's also configured with a solution to write all changes to the root filesystem into an encrypted overlay.

This solution raises a few more points:

Encrypted overlay on root is a nice way to eliminate any chance of data leaking to disk in an uncrypted format, but it doesn't solve the Docker problem
So an additional encrypted disk is needed for Docker, and ideally this is also where non Docker runner data goes such that it's all in one pool (for Github actions it goes under GITHUB_WORKSPACE by default
An interesting potential arises from this: since all data is encrypted ephemerally and the VM becomes fresh on each boot, these runners can be simply rebooted to come back as a new runner, reducing the need for workload management

Investigating the possibility of using encryption for runner VMs, the following points become apparent: * Docker has best performance when its underlying storage is `ext4`, `btrfs`, `zfs`, or `xfs` (there are some tradeoffs among these, but in general they are all better than the alternatives, which are the universally compatible `fuse-overlayfs` and `vfs` drivers) * Many runner jobs don't use Docker (the VM is the disposable unit instead), but it's an important use case and best to support it well * Using block device encryption (`dm-crypt`) or native filesystem encryption (`fscrypt`) are thus essential to providing good Docker performance with encryption * So far our micro VM kernel doesn't include either of the necessary drivers With this in mind, I already have an example full VM image, built entirely in Docker, that can support kernel based encryption schemes. It's also configured with a solution to write all changes to the root filesystem into an encrypted overlay. This solution raises a few more points: * Encrypted overlay on root is a nice way to eliminate any chance of data leaking to disk in an uncrypted format, but it doesn't solve the Docker problem * So an additional encrypted disk is needed for Docker, and ideally this is also where non Docker runner data goes such that it's all in one pool (for Github actions it goes under [`GITHUB_WORKSPACE`](https://docs.github.com/en/actions/learn-github-actions/variables) by default * An interesting potential arises from this: since all data is encrypted ephemerally and the VM becomes fresh on each boot, these runners can be simply rebooted to come back as a new runner, reducing the need for workload management

despiegk added the

Roadmap

label 2024-03-21 07:25:51 +00:00

despiegk added the

Story

label 2024-03-21 07:44:41 +00:00

despiegk commented

2024-06-02 04:51:37 +00:00

we will implement this, actually we are working on it for our own purposes, it will be using hero with indeed actrunner, gitea, ...
we will focus on gitea first

we will implement this, actually we are working on it for our own purposes, it will be using hero with indeed actrunner, gitea, ... we will focus on gitea first

despiegk added this to the next milestone 2024-06-02 04:52:07 +00:00

despiegk removed the

Roadmap

label 2024-07-28 07:53:38 +00:00

despiegk removed this from the next milestone 2024-07-28 07:53:41 +00:00

despiegk added this to the tfgrid_3_17 project 2024-07-28 07:53:45 +00:00

despiegk referenced this issue

2024-07-28 07:56:33 +00:00

Product Management Coordination on Stories #76

despiegk removed the

Story

label 2024-07-28 07:57:25 +00:00