Node and Pod Sizing#
Pod sizing is a critical aspect of Kubernetes resource management that directly impacts application performance, cluster efficiency, and operational costs. This guide covers pod sizing configurations across both on-premise (EKS Anywhere) and AWS (EKS with Karpenter) environments in the IT Common Platform.
On-Premise Pod Definitions (EKS Anywhere)#
The platform's on-premise infrastructure runs on Amazon EKS Anywhere (EKS-A) deployed on VMware vSphere. This provides a consistent Kubernetes experience across data center environments.
Node Size Specifications#
The platform defines standardized node sizes for on-premise deployments in https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/nodesize.yaml?ref_type=heads:
Size | Category |
---|---|
tiny | Minimal resources for lightweight workloads |
small | Standard workloads with modest requirements |
medium | Applications with moderate resource needs |
large | Resource-intensive applications |
xlarge | High-performance computing workloads |
For current specifications, refer to the nodesize.yaml configuration file.
Environment-Specific Node Pools#
Each environment maintains distinct node pool configurations optimized for workload requirements:
Development Environment#
Configuration file: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/dvlp.yaml?ref_type=heads
workerNodeGroupConfigurations:
- name: es
count: 2
machineRef: small # 2 CPU, 8GB RAM
- name: platform
count: 4
machineRef: small
- name: core
count: 3
autoscalingConfiguration:
minCount: 3
maxCount: 5
machineRef: small
Production Environment#
Configuration file: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/prod.yaml?ref_type=heads
workerNodeGroupConfigurations:
- name: es
count: 6
autoscalingConfiguration:
minCount: 6
maxCount: 9
machineRef: xlarge # 8 CPU, 64GB RAM
- name: sis
count: 2
machineRef: tiny # 1 CPU, 4GB RAM
- name: platform
count: 2
machineRef: small # 2 CPU, 8GB RAM
- name: core
count: 3
autoscalingConfiguration:
minCount: 3
maxCount: 8
machineRef: small
Control Plane Sizing#
Control plane components use consistent sizing across environments for stability:
controlPlaneConfiguration:
machineSpec:
numCPUs: 2
memoryMiB: 32768
diskGiB: 26
externalEtcdConfiguration:
machineSpec:
numCPUs: 2
memoryMiB: 4096
diskGiB: 26
Applying On-Premise Configuration Changes#
To modify node pool sizes in an on-premise cluster:
-
Edit the appropriate environment file:
-
Update the node pool configuration:
-
Apply changes using using the pipeline
Karpenter Autoscaling (AWS EKS)#
For AWS-based deployments, the platform uses Karpenter for intelligent node provisioning and autoscaling. Karpenter automatically provisions right-sized compute resources based on pod requirements.
Karpenter Configuration#
Base configuration: https://code.vt.edu/it-common-platform/infrastructure/eks-cluster/-/blob/main/cluster-bootstrap/environments/aws/karpenter.tf?ref_type=heads
Karpenter installation includes:
- Namespace: platform-karpenter
- Controller resources are configured appropriately for the cluster size
Node Pool Templates#
Karpenter node pools are managed through the landlord system, which provides tenant-specific node pools with resource isolation.
Template location: https://code.vt.edu/it-common-platform/platform-support/helm-charts/landlord/-/blob/main/templates/nodepool.yaml?ref_type=heads
Example Node Pool Configuration#
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: platform
spec:
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 300s # Scale down empty nodes after 5 minutes
limits:
cpu: 1000
memory: 1000Gi
template:
spec:
expireAfter: 604800s # Node TTL: 7 days
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- t3a.medium
- t3a.large
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: platform
taints:
- key: platform.it.vt.edu/node-pool
value: platform
effect: NoSchedule
Production Node Pool Examples#
Configuration file: https://code.vt.edu/it-common-platform/tenants/it-common-platform-landlord/-/blob/main/prod/tenant-config-0.yaml?ref_type=heads
Pool Name | Purpose | Instance Strategy |
---|---|---|
es | Enterprise Services workloads | Mixed instance types for flexibility |
harbor | Container registry | Optimized for storage and network |
nis-apps | NIS application workloads | Scaled for application requirements |
platform | Platform services | Balanced compute and memory |
sis | Student Information Systems | Sized for transactional workloads |
For current configurations, refer to the appropriate tenant configuration files.
Configuring Karpenter Node Pools#
To create or modify a Karpenter node pool:
-
Edit the tenant configuration:
-
Add or modify a node pool definition:
High Availability (HA) Considerations#
High availability is achieved through multiple strategies across the platform:
Replica Management#
Applications should define appropriate replica counts for availability:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3 # Minimum 2 for HA
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # no pods can be unavailable during an update to the deployment
Pod Disruption Budgets#
Protect applications during cluster maintenance with PDBs:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 1 # Or use maxUnavailable
selector:
matchLabels:
app: my-app
Anti-Affinity Rules#
Distribute pods across nodes for resilience:
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
Control Plane HA#
Both on-premise and AWS environments maintain highly available control planes:
- EKS-A: 3 control plane nodes, 3 etcd nodes
- EKS: AWS-managed control plane with multi-AZ deployment
Node Efficiency Best Practices#
Resource Requests and Limits#
Always define resource requests and limits for predictable scheduling and performance:
Example from: https://code.vt.edu/it-common-platform/tenants/aws-prod/itsl-covervt/-/blob/main/manifest.yaml?ref_type=heads
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
resources:
requests:
memory: "32Mi"
cpu: "50m"
limits:
memory: "64Mi"
cpu: "250m"
Resource Sizing Guidelines#
Workload Type | CPU Request | CPU Limit | Memory Request | Memory Limit |
---|---|---|---|---|
Microservice | 50m-100m | 250m-500m | 64Mi-128Mi | 256Mi-512Mi |
Web App | 100m-250m | 500m-1000m | 256Mi-512Mi | 1Gi-2Gi |
Database | 500m-1000m | 2000m-4000m | 1Gi-2Gi | 4Gi-8Gi |
Batch Job | 250m-500m | 1000m-2000m | 512Mi-1Gi | 2Gi-4Gi |
Quality of Service Classes#
Kubernetes assigns QoS classes based on resource specifications:
-
Guaranteed: Requests equal limits for all resources
-
Burstable: At least one request or limit set
-
BestEffort: No requests or limits (not recommended for production)
Node Pool Efficiency#
Optimize node pool configurations for cost and performance:
- Right-size instance types: Match instance capabilities to workload requirements
- Use spot instances for fault-tolerant workloads:
- Configure appropriate TTLs for empty nodes:
- Set node expiration for automatic refresh:
Monitoring and Optimization#
Resource Utilization Metrics#
Monitor pod and node resource utilization through the platform's monitoring stack:
- Prometheus metrics: CPU and memory usage per pod/node
- Kubecost: Cost analysis and optimization recommendations
- Grafana dashboards: Visualize resource trends
Optimization Process#
-
Analyze current usage:
-
Review resource requests vs actual usage:
-
Adjust resource specifications based on observed patterns
-
Implement horizontal pod autoscaling for dynamic scaling:
Common Configuration Files#
When implementing node sizing, these are the key files to modify:
On-Premise (EKS-A)#
- Node sizes: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/nodesize.yaml?ref_type=heads
- Environment configs:
- Development: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/dvlp.yaml?ref_type=heads
- Pre-production: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/pprd.yaml?ref_type=heads
- Production: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/prod.yaml?ref_type=heads
AWS (EKS with Karpenter)#
- Karpenter setup: https://code.vt.edu/it-common-platform/infrastructure/eks-cluster/-/blob/main/cluster-bootstrap/environments/aws/karpenter.tf?ref_type=heads
- Node pool templates: https://code.vt.edu/it-common-platform/platform-support/helm-charts/landlord/-/blob/main/templates/nodepool.yaml?ref_type=heads
- Tenant configurations:
- Production: https://code.vt.edu/it-common-platform/tenants/it-common-platform-landlord/-/blob/main/prod/tenant-config-0.yaml?ref_type=heads
- Pre-production: https://code.vt.edu/it-common-platform/tenants/it-common-platform-landlord/-/blob/main/pprd/tenant-config-0.yaml?ref_type=heads
Best Practices Summary#
- Always specify resource requests and limits for predictable performance
- Use appropriate node sizes based on workload requirements
- Implement HA patterns (replicas, PDBs, anti-affinity) for critical applications
- Enable Karpenter and cluster-autoscaler (HPA for pods, Karpenter for nodes) where appropriate
- Monitor resource utilization and optimize based on actual usage patterns
- Configure node TTLs to balance cost and availability
- Separate workloads using node pools and taints/tolerations for isolation