Node Pool Management#
Subsystem Goal#
This subsystem is responsible for ensuring:
- There are enough machine resources to complete the required work
- Workloads are grouped by team/group/org to add an additional security boundary and support cost accounting
Hands-on Capabilities
This subsystem is designed for spinning up and tearing down nodes using AWS APIs, making it largely dependent on AWS environments. Therefore, it has limited functionality on local clusters. For details on how we manage on-prem clusters, see the section titled Running Node Pools on AWS vs. On-Prem below.
Components in Use#
- Karpenter - provides the ability to define node provisioners using Kubernetes objects to support direct scaling
- Cluster Autoscaler - provides the ability to scale up and down machines leveraging AWS auto-scaling groups
- Gatekeeper - by using Gatekeeper's mutation support, we can force tenant pods into their respective node pool
Background#
In Kubernetes, nodes are simply machines that can run workloads. These nodes can come and go. It is important to note that there is no native concept of a node pool in Kubernetes. But, we can create the idea of node pools using other primitives of Kubernetes.
Understanding Taints and Tolerations#
For our purposes, a node pool is essentially a collection of node pools designated to run specific workloads. A node pool might exist for a specific team, running CI workloads for a team, or really anything else we can think of. The goal is they should be easy to define and flexible.
To create a node pool in Kubernetes, we will use a combination of taints and tolerations with node affinity and labels. The idea can be stated as follows:
- Taints are put on a node to ensure pods aren't accidentally scheduled
- Tolerations are placed onto pods to indicate they allow the taint
- Node affinity will require the pod to be scheduled on nodes with the specified labels
To play with taints/tolerations, try the following:
-
Get the name of your local node:
You should see output like the following:
-
Let's add a taint to the node to prevent pods from accidentally being scheduled on this node.
Run this command, replacing
docker-desktop
with the name of your node: -
Now, try launching a new pod.
You should see that the pod was successfully created.
-
If we describe the pod, we should see why it doesn't start.
And output:
Name: taint-experiment Namespace: default ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 79s default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-pool: sample}, that the pod didn't tolerate. Warning FailedScheduling 7s default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-pool: sample}, that the pod didn't tolerate.
It's the taint that is preventing the pod from being scheduled on the node!
-
Let's add a toleration to the pod and see if it starts up. We're going to add the following yaml:
The command to do so is:
-
Examine the pod now and you should that it's successfully running!
And you should see output similar to the following:
-
Before we go too much farther, let's remove the taint and our test pod. And yes... the syntax to remove a taint looks a little odd.
Defining our Node Pools with Karpenter#
Now that we understand how taints and tolerations work, let's look at how we
actually manage the various node pools. Using Karpenter, we can simply define a
Provisioner
, which provides configuration on how to spin up nodes.
The following Provisioner
will be able to create nodes using the taint we
used before and also adds labels that can be used for node affinity. We'll talk
about the cost code pieces in a moment.
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: platform-docs
spec:
# Put taints on the nodes to prevent accidental scheduling
taints:
- key: platform.it.vt.edu/node-pool
value: platform-docs
effect: NoSchedule
# Scale down empty nodes after low utilization. Defaults to an hour
ttlSecondsAfterEmpty: 300
# Kubernetes labels to be applied to the nodes
labels:
platform.it.vt.edu/cost-code: platform
platform.it.vt.edu/node-pool: platform-docs
provider:
instanceProfile: karpenter-profile
# Tags used to discover the subnets/security groups to use
securityGroupSelector:
Name: "*eks_worker_sg"
kubernetes.io/cluster/vt-common-platform-prod-cluster: owned
subnetSelector:
karpenter.sh/discovery: "*"
# Tags to be applied to the EC2 nodes themselves
tags:
CostCode: platform
Project: platform-docs
NodePool: platform-docs
Now, when a pod is defined, but unable to be scheduled (possibly to a lack of available resources), Karpenter will try to fix the issue. Karpenter will use this Provisioner if the pod spec has the matching node affinity and tolerations.
Supporting Cost Accounting#
One of the original goals of the platform was to gain an understanding of how much each team was spending on machine resources (acknowledging that there are other costs associated with running a platform beyond just machines). To support this, we are leveraging AWS Cost Allocation tags on the machine resources themselves. Both the CostCode
and Project
tags are cost allocation tags, allowing us to track spending accurately at both the team and project levels.
- CostCode - represents the higher-level team/organization
- Project - the individual node pool
By separating these tags, we can have separate node pools for different functions (such as dev, CI, or production), yet roll the costs into a higher-level team/org cost.
In addition to AWS tags, we are also utilizing Kubecost, a Kubernetes cost monitoring and management tool. Kubecost provides real-time cost visibility and insights into our Kubernetes clusters, enabling us to allocate costs down to the level of individual namespaces, workloads, and even labels. This ensures that we not only understand the cost at a broader level but can also monitor and optimize expenses for specific Kubernetes resources across the platform.
Forcing Tenants into their Node Pools#
Just as we did with log forwarding, we can leverage Gatekeeper's mutation support to mutate pods and add the correct toleration and node selector. Doing this, the idea of node pools can be mostly invisible to the tenants themselves.
The following will add a nodeSelector to all pods in the sample-tenant
namespace and force all pods to run on nodes with a label of
platform.it.vt.edu/node-pool=sample-pool
.
apiVersion: mutations.gatekeeper.sh/v1beta1
kind: AssignMetadata
metadata:
name: sample-tenant-nodepool-selector
namespace: gatekeeper-system
spec:
applyTo:
- groups: [""]
kinds: ["Pod"]
versions: ["v1"]
match:
scope: Namespaced
kinds:
- apiGroups: ["*"]
kinds: ["Pod"]
namespaces: ["sample-tenant"]
location: "spec.nodeSelector"
parameters:
assign:
value:
platform.it.vt.edu/node-pool: "sample-pool"
And then the following mutation will add a toleration
, which will allow the
pod to actually run on the nodes with the node pool taints.
apiVersion: mutations.gatekeeper.sh/v1beta1
kind: Assign
metadata:
name: sample-tenant-nodepool-toleration
namespace: gatekeeper-system
spec:
applyTo:
- groups: [""]
kinds: ["Pod"]
versions: ["v1"]
match:
scope: Namespaced
kinds:
- apiGroups: ["*"]
kinds: ["Pod"]
namespaces: ["sample-tenant"]
location: "spec.tolerations"
parameters:
assign:
value:
- key: platform.it.vt.edu/node-pool
operator: "Equal"
value: "sample-pool"
The landlord chart makes it possible for us to define the node pools themselves
(which creates the Provisioner
objects) and define the necessary mutations
needed for tenant workloads to run on the correct nodes.
Running Node Pools on AWS vs. On-Prem#
In our platform, node pool management varies depending on the environment. We use Karpenter for AWS-based clusters and EKS Anywhere (EKSA) for on-prem clusters. EKSA brings the power and flexibility of AWS EKS to on-premises environments, allowing us to maintain a consistent Kubernetes experience across both cloud and on-prem setups.
Karpenter is tightly integrated with AWS services and is specifically designed to manage node pools within AWS. For our on-prem clusters, which are built on VMware, we configure node pools using EKSA to suit the specific needs of our on-prem infrastructure.
Here’s a simplified example of a node pool configuration in an on-prem environment using EKSA:
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
name: example-nodepool
namespace: default
spec:
datastore: global-datastore
diskGiB: 50
memoryMiB: 8192
numCPUs: 4
osFamily: bottlerocket
resourcePool: /AISB-Common-Platform/host/plat-isb-cluster/Resources
This configuration defines the resource allocation for a specific node pool in a VMware-based cluster. It specifies the datastore, disk size, memory, CPU count, and operating system for the nodes, ensuring that each node pool is optimized for its intended workload.
By leveraging EKSA’s integration with VMware, we achieve smooth operations within our existing IT infrastructure while maintaining consistency across our cloud and on-prem environments. This approach ensures that our resources are utilized efficiently and securely, regardless of where the clusters are running.