Kubernetes Workersยป
We provide a Kubernetes operator for managing Spacelift worker pools. This operator allows you to define WorkerPool
resources in your cluster, and allows you to scale these pools up and down using standard Kubernetes functionality.
Info
Previously we provided a Helm chart for deploying worker pools to Kubernetes using Docker-in-Docker. This approach is no-longer recommended, and you should use the Kubernetes operator instead. Please see the section on migrating from Docker-in-Docker for more information.
A WorkerPool
defines the number of Workers
registered with Spacelift via the poolSize
parameter. The Spacelift operator will automatically create and register a number of Worker
resources in Kubernetes depending on your poolSize
.
Info
Worker
resources do not use up any cluster resources other than an entry in the Kubernetes API when they are idle. Pods
are created on demand for Workers
when scheduling messages are received from Spacelift. This means that in an idle state no additional resources are being used in your cluster other than what is required to run the controller component of the Spacelift operator.
Kubernetes version compatibilityยป
The spacelift controller is compatible with Kubernetes version v1.26+. The controller may also work with older versions, but we do not guarantee and provide support for unmaintained Kubernetes versions.
Installationยป
Controller setupยป
To install the worker pool controller along with its CRDs, run the following command:
1 |
|
Tip
You can download the manifests yourself from https://downloads.spacelift.io/kube-workerpool-controller/latest/manifests.yaml if you would like to inspect them or alter the Deployment configuration for the controller.
You can install the controller using the official spacelift-workerpool-controller Helm chart.
1 2 3 |
|
You can open values.yaml
from the helm chart repo for more customization options.
Warning
Helm has no support at this time for upgrading or deleting crd's so this would need to be done manually through kubernetes. The latest CRD's can be found in this link.
Prometheus metrics
The controller also has a subchart for our prometheus-exporter project that exposes metrics in OpenMetrics spec.
This is useful for scaling workers based on queue length in spacelift (spacelift_worker_pool_runs_pending
metric).
To install the controller with the prometheus-exporter subchart, use the following command:
1 2 3 4 5 |
|
values.yaml
file for the subchart.
Create a Secretยป
Next, create a Secret containing the private key and token for your worker pool, generated earlier in this guide.
First, export the token and private key as base64 encoded strings:
Macยป
1 2 |
|
Linuxยป
1 2 |
|
Then, create the secret.
1 2 3 4 5 6 7 8 9 10 |
|
Create a WorkerPoolยป
Finally, create a WorkerPool resource using the following command:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Grant access to the Launcher imageยป
During your Self-Hosted installation process, the Spacelift launcher image is uploaded to a private ECR in the AWS account your Self-Hosted instance is installed into. This repository is called spacelift-launcher
:
The launcher image is used during runs on Kubernetes workers to prepare the workspace for the run, and the Kubernetes cluster that you want to run your workers on needs to be able to pull that image for runs to succeed.
Some options for this include:
- If your Kubernetes cluster is running inside AWS, you can add a policy to your ECR to allow pulls from your cluster nodes.
- You can use one of the methods listed in the ECR private registry authentication guide.
- You can copy the image to a registry accessible by your cluster, and then set the
spec.pod.launcherImage
configuration option on yourWorkerPool
resource to point at it.
Info
You can deploy the controller globally (the default option) to monitor all namespaces, allowing worker pools in multiple namespaces, or restrict it to specific namespaces using the namespaces
option in the Helm chart values. The namespace of the controller and workers themselves doesnโt impact functionality.
That's it - the workers in your pool should connect to Spacelift, and you should be able to trigger runs!
Upgradeยป
Usually, there is nothing special to do for upgrading the controller.
Some release of the controller may include backward compatibility breaks, you can find below instructions about how to upgrade for those specials versions.
Upgrading to 0.0.17ยป
This release changes the way the controller exposes metrics by removing usage of the kube-rbac-proxy
container.
You can find more context about the reason for this change in the Kubebuilder repository.
If the controller was installed using compiled Kubernetes manifest using kubectl apply -f ...
,
you should first uninstall the current release before deploying the new one.
Warning
The command below will remove CRDs and thus also remove your WorkerPool
from the cluster.
Before running it, make sure that you'll be able to recreate them after the upgrade.
1 2 3 4 5 |
|
Then you can install the new controller version with the following command.
1 |
|
CRDs have been updated in this new version, and Helm does not perform CRDs update for us. So before upgrading to the latest version of the chart, you should execute the following commands to upgrade CRDs.
1 2 |
|
Once done, you can upgrade the chart like usual with helm upgrade
.
Run Containersยป
When a run assigned to a Kubernetes worker is scheduled by Spacelift, the worker pool controller creates a new Pod to process the run. This Pod consists of the following containers:
- An init container called
init
, responsible for populating the workspace for the run. - A
launcher-grpc
container that runs a gRPC server used by the worker for certain tasks like uploading the workspace between run stages, and notifying the worker when a user has requested that the run be stopped. - A
worker
container that executes your run.
The init
and launcher-grpc
containers use the public.ecr.aws/spacelift/launcher:<version>
container image published by Spacelift. By default, the Spacelift backend sends the correct value for <version>
through to the controller for each run, guaranteeing that the run is pinned to a specific image version that is compatible with the Spacelift backend.
The worker
container uses the runner image specified by your Spacelift stack.
Warning
You can use the spec.pod.launcherImage
configuration option to pin the init
and launcher-grpc
containers to a specific version, but we do not typically recommend doing this because it means that your run Pods could become incompatible with the Spacelift backend as new versions are released.
Resource Usageยป
Kubernetes Controllerยป
During normal operations the worker pool controller CPU and memory usage should be fairly stable. The main operation that can be resource intensive is scaling out a worker pool. Scaling up involves generating an RSA keypair for each worker, and is CPU-bound. If you notice performance issues when scaling out, it's worth giving the controller more CPU.
Run Podsยป
Resource requests and limits for the init
, launcher-grpc
and worker
containers can be set via your WorkerPool
definitions, like in the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
You can use the values above as a baseline to get started, but the exact values you need for your pool will depend on your individual circumstances. You should use monitoring tools to adjust these to values that make sense.
Warning
In general, we don't suggest setting very low CPU or memory limits for the init
, grpc
or worker
containers since doing so could affect the performance of runs, or even cause runs to fail if they are set too low. And in particular, the worker container resource usage will very much depend on your workloads. For example stacks with large numbers of Terraform resources may use more memory than smaller stacks.
Volumesยป
There are two volumes that are always attached to your run Pods:
- The workspace volume.
- The binaries cache volume.
Both of these volumes default to using emptyDir
storage with no size limit. Spacelift workers will function correctly without using a custom configuration for these volumes, but there may be situations where you wish to change this default, for example:
- To prevent Kubernetes evicting your run Pods due to disk pressure (and therefore causing runs to fail).
- To support caching tool binaries (for example Terraform or OpenTofu) between runs.
Workspace Volumeยป
The workspace volume is used to store the temporary workspace data needed for processing a run. This includes metadata about the run, along with your source code. The workspace volume does not need to be shared or persisted between runs, and for that reason we recommend using an Ephemeral Volume so that the volume is bound to the lifetime of the run, and will be destroyed when the run Pod is deleted.
The workspace volume can be configured via the spec.pod.workspaceVolume
property, which accepts a standard Kubernetes volume definition. Here's an example of using an ephemeral AWS GP2 volume for storage:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
Binaries Cache Volumeยป
The binaries cache volume is used to cache binaries (e.g. terraform
and kubectl
) across multiple runs. You can use an ephemeral volume for the binaries cache like with the workspace volume, but doing so will not result in any caching benefits. To be able to share the binaries cache with multiple run pods, you need to use a volume type that supports ReadWriteMany
, for example AWS EFS.
To configure the binaries cache volume, you can use exactly the same approach as with the workspace volume, the only difference is that you should use the spec.pod.binariesCacheVolume
property instead of spec.pod.workspaceVolume
.
Custom Volumesยป
See the section on configuration for more details on how to configure these two volumes along with any additional volumes you require.
Configurationยป
The following example shows all the configurable options for a WorkerPool:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
|
Configure a docker daemon as a sidecar containerยป
If for some reason you need to have a docker daemon running as a sidecar, you can follow the example below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Timeoutsยป
There are two types of timeouts that you can set
- The run timeout: this causes the run to fail if its duration exceeds a defined duration.
- The log output timeout: this causes the run to fail if no logs has been generated for a defined duration.
To configure the run timeout you need to configure two items - the activeDeadlineSeconds
for the Pod, as well as the SPACELIFT_LAUNCHER_RUN_TIMEOUT
for the worker container:
1 2 3 4 5 6 7 8 9 10 11 |
|
To configure the logs timeout you just need to add a single environment variable to the worker container:
1 2 3 4 5 6 7 8 9 10 |
|
Network Configurationยป
Your cluster configuration needs to be set up to allow the controller and the scheduled pods to reach the internet. This is required to listen for new jobs from the Spacelift backend and report back status and run logs.
You can find the necessary endpoints to allow in the Network Security section.
Initialization Policiesยป
Using an initialization policy is simple and requires three steps:
- Create a
ConfigMap
containing your policy. - Attach the
ConfigMap
as a volume in thepod
specification for your pool. - Add an environment variable to the init container, telling it where to read the policy from.
First, create your policy:
1 2 3 4 5 6 7 8 9 10 11 |
|
Next, create a WorkerPool
definition, configuring the ConfigMap
as a volume, and setting the custom env var:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
Using VCS Agents with Kubernetes Workersยป
Using VCS Agents with Kubernetes workers is simple, and uses exactly the same approach outlined in the VCS Agents section. To configure your VCS Agent environment variables in a Kubernetes WorkerPool, add them to the spec.pod.initContainer.env
section, like in the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Controller metricsยป
The workerpool controller does not expose any metrics by default.
You can set --metrics-bind-address=:8443
flag to enable them and activate the Prometheus endpoint.
By default, the controller exposes metrics using HTTPS and a self-signed certificate.
This endpoint is also protected using RBAC. If you use the helm chart to deploy the controller, you can use the built-in metrics reader role to grant access.
You may also want to use a valid certificate for production workloads. You can mount your cert in the container to the following paths:
1 2 |
|
It's also possible to fully disable TLS on the metrics endpoint and ask the controller to export metrics using http. You need to set --metrics-secure=false flag for that.
More information about metrics authentication and TLS config can be found on the kubebuilder docs.
More information about exposed metrics can be found by scrapping the metrics endpoint, see and example below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Helmยป
If you are using our Helm chart to deploy the controller, you can configure metrics by switching some boolean flags in values.yml
.
You can check the links in the comments below about how to secure your metrics endpoint.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Scaling a poolยป
To scale your WorkerPool, you can either edit the resource in Kubernetes, or use the kubectl scale
command:
1 |
|
Billing for Kubernetes Workersยป
Kubernetes workers are billed based on the number of provisioned workers that you have, exactly the same as for any of our other ways of running workers. What this means in practice is that you will be billed based on the number of workers defined by the poolSize
of your WorkerPool, even when those workers are idle and not processing any runs.
Migrating from Docker-in-Dockerยป
If you currently use our Docker-in-Docker Helm chart to run your worker pools, we recommend that you switch to our worker pool operator. For full details of how to install the operator and setup a worker pool, please see the installation section.
The rest of this section provides useful information to be aware of when switching over from the Docker-in-Docker approach to the operator.
Why migrateยป
There are a number of improvements with the Kubernetes operator over the previous Docker-in-Docker approach, including:
- The operator does not require privileged pods unlike the Docker-in-Docker approach.
- The operator creates standard Kubernetes pods to handle runs. This provides advantages including Kubernetes being aware of the run workloads that are executing as well as the ability to use built-in Kubernetes functionality like service accounts and affinity.
- The operator only creates pods when runs are scheduled. This means that while your workers are idle, they are not running pods that are using up resources in your cluster.
- The operator can safely handle scaling down the number of workers in a pool while making sure that in-progress runs are not killed.
Deploying workersยป
One major difference between the Docker-in-Docker Helm chart and the new operator is that the new chart only deploys the operator, and not any workers. To deploy workers you need to create WorkerPool resources after the operator has been deployed. See the section on creating a worker pool for more details.
Testing both alongside each otherยป
You can run both the new operator as well as your existing Docker-in-Docker workers. In fact you can even connect both to the same Spacelift worker pool. This allows you to test the operator to make sure everything is working before switching over.
Customizing timeoutsยป
If you are currently using SPACELIFT_LAUNCHER_RUN_TIMEOUT
or SPACELIFT_LAUNCHER_LOGS_TIMEOUT
, please see the section on timeouts to find out how to achieve this with the operator.
Storage configurationยป
If you are using custom storage volumes, you can configure these via the spec.pod
section of the WorkerPool resource. Please see the section on volumes for more information.
Pool sizeยป
In the Docker-in-Docker approach, the number of workers is controlled by the replicaCount
value of the Chart which controls the number of replicas in the Deployment. In the operator approach, the pool size is configured by the spec.poolSize
property. Please see the section on scaling for information about how to scale your pool up or down.
Troubleshootingยป
Listing WorkerPools and Workersยป
To list all of your WorkerPools, you can use the following command:
1 |
|
To list all of your Workers, use the following command:
1 |
|
To list the Workers for a specific pool, use the following command (replace <worker-pool-id>
with the ID of the pool from Spacelift):
1 |
|
Listing run podsยป
When a run is scheduled, a new pod is created to process that run. It's important to note that a single worker can only process a single run at a time, making it easy to find pods by run or worker IDs.
To list the pod for a specific run, use the following command (replacing <run-id>
with the ID of the run):
1 |
|
To find the pod for a particular worker, use the following command (replacing <worker-id>
with the ID of the worker):
1 |
|
Workers not connecting to Spaceliftยป
If you have created a WorkerPool in Kubernetes but no workers have shown up in Spacelift, use kubectl get workerpools
to view your pool:
1 2 3 |
|
If the actual pool size for your pool is not populated, it typically indicates an issue with your pool credentials. The first thing to do is to use kubectl describe
to inspect your pool and check for any events indicating errors:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
In the example above, we can see that the private key for the pool is invalid.
If the WorkerPool events don't provide any useful information, another option is to take a look at the logs for the controller pod using kubectl logs
, for example:
1 |
|
For example, if your token is invalid, you may find a log entry similar to the following:
1 |
|
Another common reason that can cause workers to fail to connect with Spacelift is network or firewall rules blocking connections to AWS IoT Core. Please see our network security section for more details on the networking requirements for workers.
Run not startingยป
If a run is scheduled to a worker but it gets stuck in the preparing phase for a long time, it may be caused by various issues like CPU or memory limits that are too low, or not being able to pull the stack's runner image. The best option in this scenario is to find the run pod and describe it to find out what's happening.
For example, in the following scenario, we can use kubectl get pods
to discover that the run pod is stuck in ImagePullBackOff
, meaning that it is unable to pull one of its container images:
1 2 3 |
|
If we describe that pod, we can get more details about the failure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
In this case, we can see that the problem is that the someone/non-existent-image:1234
container image cannot be pulled, meaning that the run can't start. In this situation the fix would be to add the correct authentication to allow your Kubernetes cluster to pull the image, or to adjust your stack settings to refer to the correct image if it is wrong.
Similarly, if you specify too low memory limits for one of the containers in the run pod Kubernetes may end up killing it. You can find this out in exactly the same way:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
Getting help with run issuesยป
If you're having trouble understanding why a run isn't starting, is failing, or is hanging, and want to reach out for support, please include the output of the following commands (replacing the relevant IDs/names as well as specifying the namespace of your worker pool):
kubectl get pods --namespace <worker-pool-namespace> -l "workers.spacelift.io/run-id=<run-id>"
kubectl describe pods --namespace <worker-pool-namespace> -l "workers.spacelift.io/run-id=<run-id>"
kubectl logs --namespace <worker-pool-namespace> -l "workers.spacelift.io/run-id=<run-id>" --all-containers --prefix --timestamps
kubectl events --namespace <worker-pool-namespace> workers/<worker-name> -o json
Please also include your controller logs from 10 minutes before the run started. You can do this using the --since-time
flag, like in the following example:
kubectl logs -n spacelift-worker-controller-system spacelift-worker-controllercontroller-manager-6f974d9b6d-kx566 --since-time="2024-04-02T09:00:00Z" --all-containers --prefix --timestamps
Custom runner imagesยป
Please note that if you are using a custom runner image for your stack, it must include a Spacelift user with a UID of 1983. If your image does not include this user, it can cause permission issues during runs, for example while trying to write out configuration files while preparing the run.
Please see our instructions on customizing the runner image for more information.
Inspecting successful run podsยป
By default, the operator deletes the pods for successful runs as soon as they complete. If you need to inspect a pod after the run has completed successfully for debugging purposes, you can enable spec.keepSuccessfulPods
:
1 2 3 4 5 6 7 8 |
|
Networking issues caused by Pod identityยป
When a run is assigned to a worker, the controller creates a new Pod to process that run. The Pod has labels indicating the worker and run ID, and looks something like this:
1 2 3 4 5 6 7 8 9 10 |
|
Because the set of labels are unique for each run being processed, this can cause problems with systems like Cilium that use Pod labels to determine the identity of each Pod, leading to your runs having networking issues. If you are using a system like this, you may want to exclude the workers.spacelift.io/*
labels from being used to determine network identity.