GitOps#

Subsystem Goal#

This subsystem provides the ability for tenants to define their own workloads using a GitOps-based approach.

Components in Use#

While working on this subsystem, we will introduce the following components:

Flux - provides GitOps tooling to apply manifests defined in Git repositories into specified namespaces
flux-gitlab-syncer - a custom service that watches for changes in GitRepository and Receiver objects to automate the sync of SSH keys and webhooks onto tenant repos

Background#

Why GitOps?#

When defining state in Kubernetes, there are two basic approaches that can be taken:

Push-based - tooling outside of the cluster pushes changes into the cluster. This includes CI/CD pipelines, manual changes using kubectl, or a variety of other tools. Credentials are shared that provide this access and it is up to the individual operator to determine how to both maintain their desired state and when changes are applied.
Pull-based - tooling inside the cluster applies the desired state that might be defined somewhere else, such as a git repository. In this model, the workflow is dictated by the tooling and credentials remain within the cluster.

It is this later approach that GitOps takes - agents running in the cluster watch specified Git repositories and apply the manifests container therein. This provides quite a few benefits, including:

Automatic versioning. - by leveraging git repositories, the desired state of a tenant's space is automatically versioned.
Understood source of truth. - by watching the git repositories, the agents treat the repos as the source of truth. If a cluster rebuild ever needs to occur, the same state can be redeclared.
No credentials required - by simply using git repos, we don't have to figure out how to create, properly scope, and share credentials that provide write-access to the cluster.
Support many team workflows - by simply indicating a place to drop manifests, we can support teams that need additional change management (code review), teams that want to use CI to push changes, or teams that want to update their manifests manually (as well as other workflows).

How Flux works#

Flux provides the ability for you to specify various sources and reconcilers. Simply put, a source is a location Flux should watch for manifests while the reconcilers define how those manifests should be applied. Flux splits the workload across various components.

The flux system — The Flux System. Sources/Kustomizations are defined using the k8s API, after which various controllers fetch the source and apply the located manifests

As a simple example, the following GitRepository will tell Flux to fetch materials from our docs-getting-started repo. It'll do so once every 30 minutes (more on that soon) and use SSH credentials found in a secret named flux-ssh-credentials (more on that later too).

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: docs-getting-started
  namespace: platform-flux-tenant-config
spec:
  interval: 30m
  url: ssh://git@code.vt.edu:it-common-platform/tenants/aws-prod/docs-getting-started.git
  secretRef:
    name: flux-ssh-credentials
  ref:
    branch: master

Once that's applied, Flux will start fetching the manifests in the repo. But, it needs to know how to apply those. That's where the Kustomization object comes in. The following Kustomization will apply the manifests found at the root of the repo into the docs-getting-started namespace using a ServiceAccount named flux once per hour (more on that soon).

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: docs-getting-started
  namespace: docs-getting-started
spec:
  interval: 1h
  path: ./
  prune: true
  serviceAccountName: flux
  targetNamespace: docs-getting-started
  sourceRef:
    kind: GitRepository
    name: docs-getting-started
    namespace: platform-flux-tenant-config

Once that's applied, Flux will start applying the manifests it finds.

Explaining Intervals

Although the Kustomization says it'll run every hour, it will automatically run anytime its spec.sourceRef is updated. As new manifests are found, they are quickly applied.

Then what's the interval for? This means that manifests are applied once per hour, even if there were no changes made to the source. This ensures that any manual changes or drift is automatically reverted within an hour.

Responding to Changes More Quickly#

Now that we understand how Flux observes and applies changes, how can we respond more quickly? While we could simply lower the interval, that would add quite a bit of pressure to GitLab and the K8s API. Instead, Flux provides the ability for us to define webhooks. We do so by defining a Receiver object.

The following Receiver indicates the desire for us to have a webhook that knows how to handle "push" and "tag push" events from GitLab. When this webhook is notified, we want it to trigger a sync event on the GitRepository named docs-getting-started.

apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Receiver
metadata:
  name: docs-gettings-started
  namespace: platform-flux-tenant-config
spec:
  type: gitlab
  events:
    - "Push Hook"
    - "Tag Push Hook"
  secretRef:
    name: flux-webhook-token
  resources:
    - apiVersion: source.toolkit.fluxcd.io/v1
      kind: GitRepository
      name: docs-gettings-started

After this is defined, the Flux receiver controller will process the request and register an endpoint on its service. After doing so, it'll set the URL as part of the Receiver's status.

> kubectl describe receiver docs-gettings-started
Name:         docs-getting-started
Namespace:    platform-flux-tenant-config
...
Status:
  Conditions:
    ...
  URL:  /hook/bbbc2310a34e...

With this URL configured, we can hit the endpoint using the hostname of the webhook-receiver Service in the platform-flux-system namespace. By configuring this on the manifest repo in GitLab, any changes made to the repo will notify the Receiver and cause the GitRepository to sync, find updated manifests, and deploy the changes using the Kustomization configuration. Cool, huh?

Automating the Configuration#

Since we want to make it easy to both add and remove tenants, we wanted to ease the process of configuring the SSH key and webhooks on the git repos. That's where the flux-gitlab-syncer comes in. It's job is two-fold:

Watch for new/deleted GitRepository objects - as objects are added, it looks up the SSH key being used and adds it to the referenced git repo
Watch for updated Receiver objects - once the Receiver has a URL configured, it'll look up the repo and add the webhook. If a Receiver is removed, it'll automatically remove the webhook URL

To give this service GitLab API access, we created a GitLab bot account that's a member of the tenants group in GitLab, letting it have the access it needs.

What's defined where?#

In order for us to automate the config setup, it needs access to the SSH secret. That leaves us two options... give it cluster-wide access to secrets (yikes!) or put all of the GitRepository objects in the same namespace as the service.

We also wanted to ensure tenants have access to the Kustomization object, as it conveys any errors that occurred during application. Since the Kustomization objects can reference a GitRepository in another namespace, this felt like a good approach.

flowchart LR subgraph cluster [Cluster] subgraph config [platform-flux-tenant-config Namespace] FW[Flux GitLab Syncer]-->G1[GitRepository 1] FW[Flux GitLab Syncer]-->G2[GitRepository 2] end subgraph tenant1 [Tenant 1 Namespace] K1[Kustomization]-->G1 K1-->M1[Manifests] end subgraph tenant2 [Tenant 2 Namespace] K2[Kustomization]-->G2 K2-->M2[Manifests] end end subgraph gitlab [code.vt.edu] G1-->C1[Manifest Repo for Tenant 1] G2-->C2[Manifest Repo for Tenant 2] end

Securing the Deployment Process#

We did quite a bit of testing to figure out how to lock down the tenant deployment process. A few notes:

Each Kustomization uses a scoped ServiceAccount - the SA used by the reconciler has permissions to only make changes within the tenant's namespace.
Additional policy enforcement - we wrote policies that ensure the Kustomization has an explicit ServiceAccount defined (it defaults to the one used by the controller) and has a targetNamespace of its own namespace

Deploying it Yourself#

To run this subsystem, we need to deploy Flux and its various components, define sources and reconcilers, and configure a webhook receiver.

Webhook Receivers on Local Clusters

Due to the fact we are running our tutorial environment on local clusters, GitLab won't be able to reach out to notify our webhook receivers. Despite that, we'll invoke the endpoints manually to demonstrate it working.

Deploying Flux#

Let's deploy Flux! Fortunately, it's as easy as deploying its Helm chart!

Create the namespace the Flux components will run in. On the actual platform, we use the namespace platform-flux-system. But, we had to manage the templates ourselves to support this. For simplicity's sake, we're going to simply use what Flux wants to use.
```
kubectl create namespace flux-system
```
Install Flux by using the command below. Unfortunately, they don't have a Helm chart to deploy the services
```
kubectl apply -f https://github.com/fluxcd/flux2/releases/latest/download/install.yaml
```

Validate you see the pods startup on your machine:

kubectl get pods -n flux-system

You should see output that looks like the following:

NAME                                          READY   STATUS    RESTARTS   AGE
helm-controller-7bc48949c4-kdbr2              1/1     Running   0          43s
image-automation-controller-54774798d-4pn4v   1/1     Running   0          43s
image-reflector-controller-8589c7df7d-d7ztb   1/1     Running   0          43s
kustomize-controller-55b457d666-nzkr6         1/1     Running   0          43s
notification-controller-6d5d78654-vqx58       1/1     Running   0          43s
source-controller-579fc5dfb-4grss             1/1     Running   0          43s

Creating and Configuring a Tenant Repo#

Now that Flux is deployed, let's setup the config namespace and define the config necessary to pull in manifests from a mock tenant repo.

First, let's create the config namespace and a namespace for the tenant.

kubectl create namespace platform-flux-tenant-config
kubectl create namespace sample-tenant

Go to code.vt.edu and create a blank project in your personal namespace simply called sample-tenant. Make it private and initialize it with a README.
In order for Flux to pull the manifests, we need to create a SSH key and configure it on the repo. We'll also perform a key scan that's needed to support SSH-based cloning. Run the following command:
```
docker run --rm -tiv "$(pwd):/tmp" alpine sh -c "apk add openssh && ssh-keygen -t rsa -f /tmp/tenant-repo-key -N '' && ssh-keyscan code.vt.edu > /tmp/known_hosts"
```
This will create three files in your current directory, named tenant-repo-key, tenant-repo-key.pub and known_hosts.
Now, copy the contents of the tenant-repo-key.pub file and configure it as a deploy key on the repo (found at Settings -> Repository -> Deploy keys). You can give the key any name and it doesn't need write access.
Before we can create the GitRepository object, we need to create a secret that contains the SSH key's private key, public key, and known_hosts config. Run the following command to create the secret:
```
kubectl create secret generic flux-ssh-credentials -n platform-flux-tenant-config --from-file=identity=tenant-repo-key --from-file=identity.pub=tenant-repo-key.pub --from-file=known_hosts
```

Now that all of that is setup, we can create the GitRepository object! First, define an env var that we can use for substitution for the repo path:

export PID=<your-pid>

Now, run the following command:

cat <<EOF | kubectl apply -f -
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: sample-tenant
  namespace: platform-flux-tenant-config
spec:
  interval: 1m
  url: ssh://git@code.vt.edu/${PID}/sample-tenant
  secretRef:
    name: flux-ssh-credentials
  ref:
    branch: main
EOF

If we look at the GitRepository, we should see that it successfully fetched the resources. We have a very short interval for now, but can adjust that later.

kubectl get gitrepositories -n platform-flux-tenant-config

And you should see output similar to this:

NAME            URL                                           READY   STATUS                                                            AGE
sample-tenant   ssh://git@code.vt.edu/mikesir/sample-tenant   True    Fetched revision: main/2ca07ce1cf1b5bb08a0f164a54d8d9742cc0128c   96s

Flux can fetch and pull your resources!

Applying the Manifests#

Now that Flux can pull the manifests, lets configure it with a reconciler, which defines where and how to apply the manifests.

Before defining the Kustomization object, we want to create a ServiceAccount for Flux to use. This will ensure the manifests can only be applied within the tenant namespace.

Run the following command to create the ServiceAccount:
```
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: flux
  namespace: sample-tenant
EOF
```

Now that we have the ServiceAccount created, we need to give it the proper permissions. For simplicity, we're going to give it the admin role (defined as a ClusterRole), but limit it to the sample-tenant namespace. We do this by using a RoleBinding.

Run the following command to create the RoleBinding:

cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: flux-rb
  namespace: sample-tenant
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: admin
subjects:
  - name: flux
    namespace: sample-tenant
    kind: ServiceAccount
EOF

We're now ready to define the Kustomization itself! Run the following command to define it:

cat <<EOF | kubectl apply -f -
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: sample-tenant
  namespace: sample-tenant
spec:
  interval: 1h
  path: ./
  prune: true
  serviceAccountName: flux
  targetNamespace: sample-tenant
  sourceRef:
    kind: GitRepository
    name: sample-tenant
    namespace: platform-flux-tenant-config
EOF

Let's check the Kustomization to see if it worked!

kubectl get kustomizations -n sample-tenant

You should see output that looks similar to the following:

NAME            READY   STATUS                                                            AGE
sample-tenant   True    Applied revision: main/2ca07ce1cf1b5bb08a0f164a54d8d9742cc0128c   6s

It worked!

Now, let's define a simple manifest in our tenant repo. Create a file named sample-app.yaml with the following contents (you can use the GitLab web interface to do so). This will define an entire application

apiVersion: v1
kind: Service
metadata:
  name: sample-app
spec:
  selector:
    app: sample-app
  ports:
    - port: 3000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
spec:
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
    spec:
      containers:
        - name: sample-app
          image: code.vt.edu:5005/it-common-platform/tenant-support/images/pilotfest-2021/first-image:latest
          ports:
            - name: http
              containerPort: 3000
          resources:
            requests:
              memory: 32Mi
              cpu: 50m
            limits:
              memory: 128Mi
              cpu: 500m
          livenessProbe:
            httpGet:
              path: /
              port: 3000
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: sample-app
spec:
  commonName: sample-app.localhost
  dnsNames:
    - sample-app.localhost
  secretName: sample-app-tls-cert
  issuerRef:
    kind: ClusterIssuer
    name: platform-internal-ca
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: sample-app
spec:
  rules:
    - host: sample-app.localhost
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: sample-app
              port: 
                number: 3000
  tls:
  - hosts:
      - sample-app.localhost
    secretName: sample-app-tls-cert

Once you've applied this change, it might take up to a minute to apply, as our source is syncing only once per minute.

Eventually, you should see a pod running in the sample-tenant namespace:

kubectl get pods -n sample-tenant

You should see output similar to the following:

NAME                          READY   STATUS    RESTARTS   AGE
sample-app-777cf7f5cd-jb7cp   1/1     Running   0          110s

Configuring a Webhook Endpoint#

Now that we have a tenant workflow up and running, let's look at how to make things a little smoother. You might have noticed that we have the spec.interval on the GitRepository set to 1m, which is very tight and takes quite a bit of resources if we're watching dozens (or even hundreds) of repos. By leveraging webhooks, we can tune that interval down dramatically and notify Flux when it needs to fetch new changes.

First, let's turn down the interval on the GitRepository to once per hour. Run the following command to do that:

kubectl patch -n platform-flux-tenant-config gitrepository sample-tenant -p '{"spec":{"interval":"1h"}}' --type=merge

To configure Flux to support webhook notifications, we have to define a Receiver. When this object is defined, the Flux notification system will provision a URL and update the object with the URL.

The webhook endpoint needs a secret that defines a token. For GitLab webhooks, this can be provided to help "authenticate" the request. For simplicity here, we're going to use the generic type of webhook. Even though no token validation will be done, the secret is still required.

Run the following command to define the Receiver:

TOKEN=$(head -c 12 /dev/urandom | shasum | cut -d ' ' -f1)
kubectl create secret generic -n platform-flux-tenant-config webhook-token --from-literal=token=$TOKEN

cat <<EOF | kubectl apply -f -
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Receiver
metadata:
  name: sample-tenant
  namespace: platform-flux-tenant-config
spec:
  type: generic
  secretRef:
    name: webhook-token
  resources:
    - apiVersion: source.toolkit.fluxcd.io/v1
      kind: GitRepository
      name: sample-tenant
EOF

We should now be able to see the configured Receiver:

kubectl get receivers -n platform-flux-tenant-config

And you should see output similar to the following:

NAME            READY   STATUS                                                                                                  AGE
sample-tenant   True    Receiver initialized with URL: /hook/14c781aab174583908ff3a1509d2fc2e2f940ec481e60f78d61e5c06b3cd9d9e   56s

Now, the question is... how do we actually hit the endpoint? All we need to do is define an Ingress for the webhook-receiver Service in the flux-system namespace. Then, we can hit the endpoint using our browser!

Run the following command to define an Ingress. We're going to use the hostname flux-webhooks.localhost.

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webhook-receiver
  namespace: flux-system
spec:
  rules:
    - host: flux-webhooks.localhost
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: webhook-receiver
              port: 
                name: http
EOF

Now, make a change to the manifest in your tenant repo. After you do that, trigger the webhook manually by running the following command:

WEBHOOK_ENDPOINT=$(kubectl get receiver -n platform-flux-tenant-config sample-tenant -o=jsonpath="{.status.url}")
curl http://flux-webhooks.localhost${WEBHOOK_ENDPOINT} --resolve flux-webhooks.localhost:80:127.0.0.1

You should see the changes roll out immediately! Cool, huh?

Wrapping Up#

Now that you've played with Flux a little bit, you can see how the various objects are used to configure the subsystem. When we go over the landlord, we'll see how it provides the ability to template out the resources.

Looping back to the flux-gitlab-syncer we mentioned earlier, it's sole job is to sync the SSH keys and webhook URLs onto the manifest repos automatically for us. That way, it's one less step we have to perform!

What's next?#

Now that we can support tenant workloads for simple applications, how can we ensure one tenant doesn't perform actions that affect another? To answer that, we'll talk about policy enforcement!

Go to the Policy Enforcement subsystem now!

Common Troubleshooting Tips#

Why are the tenant's manifests not being applied?

Are they actually not being applied or simply failing to apply? Look at the events on the Kustomization to find out.
```
kubectl describe kustomization -n <tenant-id> <tenant-id>
```
If the manifests are failing to apply, you'll see an error indicating why.

Validate the GitRepository has pulled the latest commit from the repo.

kubectl get gitrepositories -n platform-flux-tenant-config <tenant-id>

If not, validate the flux-gitlab-syncer is running and doesn't have any errors in the logs. If it's running, you should also see the SSH key and webhook configured on the manifest repo.