GitOps#
Subsystem Goal#
This subsystem provides the ability for tenants to define their own workloads using a GitOps-based approach.
Components in Use#
While working on this subsystem, we will introduce the following components:
- Flux - provides GitOps tooling to apply manifests defined in Git repositories into specified namespaces
- flux-gitlab-syncer - a
custom service that watches for changes in
GitRepository
andReceiver
objects to automate the sync of SSH keys and webhooks onto tenant repos
Background#
Why GitOps?#
When defining state in Kubernetes, there are two basic approaches that can be taken:
- Push-based - tooling outside of the cluster pushes changes into the cluster. This includes CI/CD pipelines, manual changes using kubectl, or a variety of other tools. Credentials are shared that provide this access and it is up to the individual operator to determine how to both maintain their desired state and when changes are applied.
- Pull-based - tooling inside the cluster applies the desired state that might be defined somewhere else, such as a git repository. In this model, the workflow is dictated by the tooling and credentials remain within the cluster.
It is this later approach that GitOps takes - agents running in the cluster watch specified Git repositories and apply the manifests container therein. This provides quite a few benefits, including:
- Automatic versioning. - by leveraging git repositories, the desired state of a tenant's space is automatically versioned.
- Understood source of truth. - by watching the git repositories, the agents treat the repos as the source of truth. If a cluster rebuild ever needs to occur, the same state can be redeclared.
- No credentials required - by simply using git repos, we don't have to figure out how to create, properly scope, and share credentials that provide write-access to the cluster.
- Support many team workflows - by simply indicating a place to drop manifests, we can support teams that need additional change management (code review), teams that want to use CI to push changes, or teams that want to update their manifests manually (as well as other workflows).
How Flux works#
Flux provides the ability for you to specify various sources and reconcilers. Simply put, a source is a location Flux should watch for manifests while the reconcilers define how those manifests should be applied. Flux splits the workload across various components.
As a simple example, the following GitRepository
will tell Flux to fetch materials
from our docs-getting-started
repo. It'll do so once every 30 minutes (more on that soon) and use SSH credentials
found in a secret named flux-ssh-credentials
(more on that later too).
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
name: docs-getting-started
namespace: platform-flux-tenant-config
spec:
interval: 30m
url: ssh://git@code.vt.edu:it-common-platform/tenants/aws-prod/docs-getting-started.git
secretRef:
name: flux-ssh-credentials
ref:
branch: master
Once that's applied, Flux will start fetching the manifests in the repo. But, it
needs to know how to apply those. That's where the Kustomization
object comes in.
The following Kustomization
will apply the manifests found at the root of the repo
into the docs-getting-started
namespace using a ServiceAccount named flux
once
per hour (more on that soon).
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
name: docs-getting-started
namespace: docs-getting-started
spec:
interval: 1h
path: ./
prune: true
serviceAccountName: flux
targetNamespace: docs-getting-started
sourceRef:
kind: GitRepository
name: docs-getting-started
namespace: platform-flux-tenant-config
Once that's applied, Flux will start applying the manifests it finds.
Explaining Intervals
Although the Kustomization
says it'll run every hour, it will automatically
run anytime its spec.sourceRef
is updated. As new manifests are found, they
are quickly applied.
Then what's the interval for? This means that manifests are applied once per hour, even if there were no changes made to the source. This ensures that any manual changes or drift is automatically reverted within an hour.
Responding to Changes More Quickly#
Now that we understand how Flux observes and applies changes, how can we respond
more quickly? While we could simply lower the interval, that would add quite a
bit of pressure to GitLab and the K8s API. Instead, Flux provides the ability for
us to define webhooks. We do so by defining a Receiver
object.
The following Receiver
indicates the desire for us to have a webhook that knows
how to handle "push" and "tag push" events from GitLab. When this webhook is
notified, we want it to trigger a sync event on the GitRepository
named
docs-getting-started
.
apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Receiver
metadata:
name: docs-gettings-started
namespace: platform-flux-tenant-config
spec:
type: gitlab
events:
- "Push Hook"
- "Tag Push Hook"
secretRef:
name: flux-webhook-token
resources:
- apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
name: docs-gettings-started
After this is defined, the Flux receiver controller will process the request and register an endpoint on its service. After doing so, it'll set the URL as part of the Receiver's status.
> kubectl describe receiver docs-gettings-started
Name: docs-getting-started
Namespace: platform-flux-tenant-config
...
Status:
Conditions:
...
URL: /hook/bbbc2310a34e...
With this URL configured, we can hit the endpoint using the hostname of the
webhook-receiver
Service in the platform-flux-system
namespace. By configuring
this on the manifest repo in GitLab, any changes made to the repo will notify the
Receiver
and cause the GitRepository
to sync, find updated manifests, and
deploy the changes using the Kustomization
configuration. Cool, huh?
Automating the Configuration#
Since we want to make it easy to both add and remove tenants, we wanted to ease the process of configuring the SSH key and webhooks on the git repos. That's where the flux-gitlab-syncer comes in. It's job is two-fold:
- Watch for new/deleted
GitRepository
objects - as objects are added, it looks up the SSH key being used and adds it to the referenced git repo - Watch for updated
Receiver
objects - once theReceiver
has a URL configured, it'll look up the repo and add the webhook. If aReceiver
is removed, it'll automatically remove the webhook URL
To give this service GitLab API access, we created a GitLab bot account that's
a member of the tenants
group in GitLab, letting it have the access it needs.
What's defined where?#
In order for us to automate the config setup, it needs access to the SSH secret.
That leaves us two options... give it cluster-wide access to secrets (yikes!) or
put all of the GitRepository
objects in the same namespace as the service.
We also wanted to ensure tenants have access to the Kustomization
object, as it
conveys any errors that occurred during application. Since the Kustomization
objects can reference a GitRepository
in another namespace, this felt like a good
approach.
Securing the Deployment Process#
We did quite a bit of testing to figure out how to lock down the tenant deployment process. A few notes:
- Each Kustomization uses a scoped ServiceAccount - the SA used by the reconciler has permissions to only make changes within the tenant's namespace.
- Additional policy enforcement - we wrote policies that ensure the
Kustomization
has an explicit ServiceAccount defined (it defaults to the one used by the controller) and has a targetNamespace of its own namespace
Deploying it Yourself#
To run this subsystem, we need to deploy Flux and its various components, define sources and reconcilers, and configure a webhook receiver.
Webhook Receivers on Local Clusters
Due to the fact we are running our tutorial environment on local clusters, GitLab won't be able to reach out to notify our webhook receivers. Despite that, we'll invoke the endpoints manually to demonstrate it working.
Deploying Flux#
Let's deploy Flux! Fortunately, it's as easy as deploying its Helm chart!
-
Create the namespace the Flux components will run in. On the actual platform, we use the namespace
platform-flux-system
. But, we had to manage the templates ourselves to support this. For simplicity's sake, we're going to simply use what Flux wants to use. -
Install Flux by using the command below. Unfortunately, they don't have a Helm chart to deploy the services
-
Validate you see the pods startup on your machine:
You should see output that looks like the following:
NAME READY STATUS RESTARTS AGE helm-controller-7bc48949c4-kdbr2 1/1 Running 0 43s image-automation-controller-54774798d-4pn4v 1/1 Running 0 43s image-reflector-controller-8589c7df7d-d7ztb 1/1 Running 0 43s kustomize-controller-55b457d666-nzkr6 1/1 Running 0 43s notification-controller-6d5d78654-vqx58 1/1 Running 0 43s source-controller-579fc5dfb-4grss 1/1 Running 0 43s
Creating and Configuring a Tenant Repo#
Now that Flux is deployed, let's setup the config namespace and define the config necessary to pull in manifests from a mock tenant repo.
-
First, let's create the config namespace and a namespace for the tenant.
-
Go to code.vt.edu and create a blank project in your personal namespace simply called
sample-tenant
. Make it private and initialize it with a README. -
In order for Flux to pull the manifests, we need to create a SSH key and configure it on the repo. We'll also perform a key scan that's needed to support SSH-based cloning. Run the following command:
docker run --rm -tiv "$(pwd):/tmp" alpine sh -c "apk add openssh && ssh-keygen -t rsa -f /tmp/tenant-repo-key -N '' && ssh-keyscan code.vt.edu > /tmp/known_hosts"
This will create three files in your current directory, named
tenant-repo-key
,tenant-repo-key.pub
andknown_hosts
. -
Now, copy the contents of the
tenant-repo-key.pub
file and configure it as a deploy key on the repo (found at Settings -> Repository -> Deploy keys). You can give the key any name and it doesn't need write access. -
Before we can create the
GitRepository
object, we need to create a secret that contains the SSH key's private key, public key, and known_hosts config. Run the following command to create the secret: -
Now that all of that is setup, we can create the
GitRepository
object! First, define an env var that we can use for substitution for the repo path:Now, run the following command:
cat <<EOF | kubectl apply -f - apiVersion: source.toolkit.fluxcd.io/v1 kind: GitRepository metadata: name: sample-tenant namespace: platform-flux-tenant-config spec: interval: 1m url: ssh://git@code.vt.edu/${PID}/sample-tenant secretRef: name: flux-ssh-credentials ref: branch: main EOF
If we look at the
GitRepository
, we should see that it successfully fetched the resources. We have a very short interval for now, but can adjust that later.And you should see output similar to this:
NAME URL READY STATUS AGE sample-tenant ssh://git@code.vt.edu/mikesir/sample-tenant True Fetched revision: main/2ca07ce1cf1b5bb08a0f164a54d8d9742cc0128c 96s
Flux can fetch and pull your resources!
Applying the Manifests#
Now that Flux can pull the manifests, lets configure it with a reconciler, which defines where and how to apply the manifests.
-
Before defining the
Kustomization
object, we want to create aServiceAccount
for Flux to use. This will ensure the manifests can only be applied within the tenant namespace.Run the following command to create the
ServiceAccount
: -
Now that we have the
ServiceAccount
created, we need to give it the proper permissions. For simplicity, we're going to give it theadmin
role (defined as aClusterRole
), but limit it to thesample-tenant
namespace. We do this by using aRoleBinding
.Run the following command to create the
RoleBinding
: -
We're now ready to define the
Kustomization
itself! Run the following command to define it:cat <<EOF | kubectl apply -f - apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 kind: Kustomization metadata: name: sample-tenant namespace: sample-tenant spec: interval: 1h path: ./ prune: true serviceAccountName: flux targetNamespace: sample-tenant sourceRef: kind: GitRepository name: sample-tenant namespace: platform-flux-tenant-config EOF
Let's check the
Kustomization
to see if it worked!You should see output that looks similar to the following:
NAME READY STATUS AGE sample-tenant True Applied revision: main/2ca07ce1cf1b5bb08a0f164a54d8d9742cc0128c 6s
It worked!
-
Now, let's define a simple manifest in our tenant repo. Create a file named
sample-app.yaml
with the following contents (you can use the GitLab web interface to do so). This will define an entire applicationapiVersion: v1 kind: Service metadata: name: sample-app spec: selector: app: sample-app ports: - port: 3000 --- apiVersion: apps/v1 kind: Deployment metadata: name: sample-app spec: selector: matchLabels: app: sample-app template: metadata: labels: app: sample-app spec: containers: - name: sample-app image: code.vt.edu:5005/it-common-platform/tenant-support/images/pilotfest-2021/first-image:latest ports: - name: http containerPort: 3000 resources: requests: memory: 32Mi cpu: 50m limits: memory: 128Mi cpu: 500m livenessProbe: httpGet: path: / port: 3000 --- apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: sample-app spec: commonName: sample-app.localhost dnsNames: - sample-app.localhost secretName: sample-app-tls-cert issuerRef: kind: ClusterIssuer name: platform-internal-ca --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: sample-app spec: rules: - host: sample-app.localhost http: paths: - path: / pathType: Prefix backend: service: name: sample-app port: number: 3000 tls: - hosts: - sample-app.localhost secretName: sample-app-tls-cert
Once you've applied this change, it might take up to a minute to apply, as our source is syncing only once per minute.
Eventually, you should see a pod running in the
sample-tenant
namespace:You should see output similar to the following:
Configuring a Webhook Endpoint#
Now that we have a tenant workflow up and running, let's look at how to make
things a little smoother. You might have noticed that we have the spec.interval
on the GitRepository
set to 1m
, which is very tight and takes quite a bit
of resources if we're watching dozens (or even hundreds) of repos. By
leveraging webhooks, we can tune that interval down dramatically and notify
Flux when it needs to fetch new changes.
-
First, let's turn down the interval on the
GitRepository
to once per hour. Run the following command to do that: -
To configure Flux to support webhook notifications, we have to define a
Receiver
. When this object is defined, the Flux notification system will provision a URL and update the object with the URL.The webhook endpoint needs a secret that defines a token. For GitLab webhooks, this can be provided to help "authenticate" the request. For simplicity here, we're going to use the
generic
type of webhook. Even though no token validation will be done, the secret is still required.Run the following command to define the
Receiver
:TOKEN=$(head -c 12 /dev/urandom | shasum | cut -d ' ' -f1) kubectl create secret generic -n platform-flux-tenant-config webhook-token --from-literal=token=$TOKEN cat <<EOF | kubectl apply -f - apiVersion: notification.toolkit.fluxcd.io/v1beta2 kind: Receiver metadata: name: sample-tenant namespace: platform-flux-tenant-config spec: type: generic secretRef: name: webhook-token resources: - apiVersion: source.toolkit.fluxcd.io/v1beta2 kind: GitRepository name: sample-tenant EOF
We should now be able to see the configured
Receiver
:And you should see output similar to the following:
-
Now, the question is... how do we actually hit the endpoint? All we need to do is define an
Ingress
for thewebhook-receiver
Service
in theflux-system
namespace. Then, we can hit the endpoint using our browser!Run the following command to define an
Ingress
. We're going to use the hostnameflux-webhooks.localhost
. -
Now, make a change to the manifest in your tenant repo. After you do that, trigger the webhook manually by running the following command:
WEBHOOK_ENDPOINT=$(kubectl get receiver -n platform-flux-tenant-config sample-tenant -o=jsonpath="{.status.url}") curl http://flux-webhooks.localhost${WEBHOOK_ENDPOINT} --resolve flux-webhooks.localhost:80:127.0.0.1
You should see the changes roll out immediately! Cool, huh?
Wrapping Up#
Now that you've played with Flux a little bit, you can see how the various objects are used to configure the subsystem. When we go over the landlord, we'll see how it provides the ability to template out the resources.
Looping back to the flux-gitlab-syncer we mentioned earlier, it's sole job is to sync the SSH keys and webhook URLs onto the manifest repos automatically for us. That way, it's one less step we have to perform!
What's next?#
Now that we can support tenant workloads for simple applications, how can we ensure one tenant doesn't perform actions that affect another? To answer that, we'll talk about policy enforcement!
Go to the Policy Enforcement subsystem now!
Common Troubleshooting Tips#
Why are the tenant's manifests not being applied?
-
Are they actually not being applied or simply failing to apply? Look at the events on the
Kustomization
to find out.If the manifests are failing to apply, you'll see an error indicating why.
-
Validate the
GitRepository
has pulled the latest commit from the repo. -
If not, validate the
flux-gitlab-syncer
is running and doesn't have any errors in the logs. If it's running, you should also see the SSH key and webhook configured on the manifest repo.