Node and Pod Sizing#

Pod sizing is a critical aspect of Kubernetes resource management that directly impacts application performance, cluster efficiency, and operational costs. This guide covers pod sizing configurations across both on-premise (EKS Anywhere) and AWS (EKS with Karpenter) environments in the IT Common Platform.

On-Premise Pod Definitions (EKS Anywhere)#

The platform's on-premise infrastructure runs on Amazon EKS Anywhere (EKS-A) deployed on VMware vSphere. This provides a consistent Kubernetes experience across data center environments.

Node Size Specifications#

The platform defines standardized node sizes for on-premise deployments in https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/nodesize.yaml?ref_type=heads:

Size	Category
tiny	Minimal resources for lightweight workloads
small	Standard workloads with modest requirements
medium	Applications with moderate resource needs
large	Resource-intensive applications
xlarge	High-performance computing workloads

For current specifications, refer to the nodesize.yaml configuration file.

Environment-Specific Node Pools#

Each environment maintains distinct node pool configurations optimized for workload requirements:

Development Environment#

Configuration file: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/dvlp.yaml?ref_type=heads

workerNodeGroupConfigurations:
  - name: es
    count: 2
    machineRef: small  # 2 CPU, 8GB RAM
  - name: platform
    count: 4
    machineRef: small
  - name: core
    count: 3
    autoscalingConfiguration:
      minCount: 3
      maxCount: 5
    machineRef: small

Production Environment#

Configuration file: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/prod.yaml?ref_type=heads

workerNodeGroupConfigurations:
  - name: es
    count: 6
    autoscalingConfiguration:
      minCount: 6
      maxCount: 9
    machineRef: xlarge  # 8 CPU, 64GB RAM
  - name: sis
    count: 2
    machineRef: tiny    # 1 CPU, 4GB RAM
  - name: platform
    count: 2
    machineRef: small   # 2 CPU, 8GB RAM
  - name: core
    count: 3
    autoscalingConfiguration:
      minCount: 3
      maxCount: 8
    machineRef: small

Control Plane Sizing#

Control plane components use consistent sizing across environments for stability:

controlPlaneConfiguration:
  machineSpec:
    numCPUs: 2
    memoryMiB: 32768
    diskGiB: 26

externalEtcdConfiguration:
  machineSpec:
    numCPUs: 2
    memoryMiB: 4096
    diskGiB: 26

Applying On-Premise Configuration Changes#

To modify node pool sizes in an on-premise cluster:

Edit the appropriate environment file:

vim it-common-platform/infrastructure/eksa-vsphere/prod.yaml

Update the node pool configuration:

workerNodeGroupConfigurations:
  - name: your-pool
    count: 4
    machineRef: medium  # Change size reference

Apply changes using using the pipeline

Karpenter Autoscaling (AWS EKS)#

For AWS-based deployments, the platform uses Karpenter for intelligent node provisioning and autoscaling. Karpenter automatically provisions right-sized compute resources based on pod requirements.

Karpenter Configuration#

Base configuration: https://code.vt.edu/it-common-platform/infrastructure/eks-cluster/-/blob/main/cluster-bootstrap/environments/aws/karpenter.tf?ref_type=heads

Karpenter installation includes: - Namespace: platform-karpenter - Controller resources are configured appropriately for the cluster size

Node Pool Templates#

Karpenter node pools are managed through the landlord system, which provides tenant-specific node pools with resource isolation.

Template location: https://code.vt.edu/it-common-platform/platform-support/helm-charts/landlord/-/blob/main/templates/nodepool.yaml?ref_type=heads

Example Node Pool Configuration#

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: platform
spec:
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 300s  # Scale down empty nodes after 5 minutes

  limits:
    cpu: 1000
    memory: 1000Gi

  template:
    spec:
      expireAfter: 604800s  # Node TTL: 7 days

      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t3a.medium
        - t3a.large

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: platform

      taints:
      - key: platform.it.vt.edu/node-pool
        value: platform
        effect: NoSchedule

Production Node Pool Examples#

Configuration file: https://code.vt.edu/it-common-platform/tenants/it-common-platform-landlord/-/blob/main/prod/tenant-config-0.yaml?ref_type=heads

Pool Name	Purpose	Instance Strategy
es	Enterprise Services workloads	Mixed instance types for flexibility
harbor	Container registry	Optimized for storage and network
nis-apps	NIS application workloads	Scaled for application requirements
platform	Platform services	Balanced compute and memory
sis	Student Information Systems	Sized for transactional workloads

For current configurations, refer to the appropriate tenant configuration files.

Configuring Karpenter Node Pools#

To create or modify a Karpenter node pool:

Edit the tenant configuration:

vim .../it-common-platform/tenants/it-common-platform-landlord/prod/tenant-config-0.yaml

Add or modify a node pool definition:

nodePools:
  - name: my-app
    instanceTypes:
      - t3a.medium
      - t3a.large
    limits:
      cpu: 10
      memory: 32Gi
    emptyTtl: 300
    costCode: "CC12345"
    capacityTypes: ["on-demand", "spot"]  # Optional: enable spot instances

High Availability (HA) Considerations#

High availability is achieved through multiple strategies across the platform:

Replica Management#

Applications should define appropriate replica counts for availability:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3  # Minimum 2 for HA
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # no pods can be unavailable during an update to the deployment

Pod Disruption Budgets#

Protect applications during cluster maintenance with PDBs:

Example from: https://code.vt.edu/it-common-platform/platform-support/helm-charts/simple-reverse-proxy/-/blob/main/templates/poddisruptionbudget.yaml?ref_type=heads

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 1  # Or use maxUnavailable
  selector:
    matchLabels:
      app: my-app

Anti-Affinity Rules#

Distribute pods across nodes for resilience:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - my-app
              topologyKey: kubernetes.io/hostname

Control Plane HA#

Both on-premise and AWS environments maintain highly available control planes:

EKS-A: 3 control plane nodes, 3 etcd nodes
EKS: AWS-managed control plane with multi-AZ deployment

Node Efficiency Best Practices#

Resource Requests and Limits#

Always define resource requests and limits for predictable scheduling and performance:

Example from: https://code.vt.edu/it-common-platform/tenants/aws-prod/itsl-covervt/-/blob/main/manifest.yaml?ref_type=heads

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            memory: "32Mi"
            cpu: "50m"
          limits:
            memory: "64Mi"
            cpu: "250m"

Resource Sizing Guidelines#

Workload Type	CPU Request	CPU Limit	Memory Request	Memory Limit
Microservice	50m-100m	250m-500m	64Mi-128Mi	256Mi-512Mi
Web App	100m-250m	500m-1000m	256Mi-512Mi	1Gi-2Gi
Database	500m-1000m	2000m-4000m	1Gi-2Gi	4Gi-8Gi
Batch Job	250m-500m	1000m-2000m	512Mi-1Gi	2Gi-4Gi

Quality of Service Classes#

Kubernetes assigns QoS classes based on resource specifications:

Guaranteed: Requests equal limits for all resources

resources:
  requests:
    cpu: "1"
    memory: "1Gi"
  limits:
    cpu: "1"
    memory: "1Gi"

Burstable: At least one request or limit set

resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

BestEffort: No requests or limits (not recommended for production)

Node Pool Efficiency#

Optimize node pool configurations for cost and performance:

Right-size instance types: Match instance capabilities to workload requirements
Use spot instances for fault-tolerant workloads:
```
capacityTypes: ["on-demand", "spot"]
```

Configure appropriate TTLs for empty nodes:

emptyTtl: 300  # 5 minutes for production

Set node expiration for automatic refresh:
```
expireAfter: 604800s  # 7 days
```

Monitoring and Optimization#

Resource Utilization Metrics#

Monitor pod and node resource utilization through the platform's monitoring stack:

Prometheus metrics: CPU and memory usage per pod/node
Kubecost: Cost analysis and optimization recommendations
Grafana dashboards: Visualize resource trends

Optimization Process#

Analyze current usage:

kubectl top pods -n my-namespace
kubectl top nodes

Review resource requests vs actual usage:

kubectl describe pod my-pod -n my-namespace

Adjust resource specifications based on observed patterns

Implement horizontal pod autoscaling for dynamic scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Common Configuration Files#

When implementing node sizing, these are the key files to modify:

On-Premise (EKS-A)#

Node sizes: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/nodesize.yaml?ref_type=heads
Environment configs:
Development: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/dvlp.yaml?ref_type=heads
Pre-production: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/pprd.yaml?ref_type=heads
Production: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/prod.yaml?ref_type=heads

AWS (EKS with Karpenter)#

Karpenter setup: https://code.vt.edu/it-common-platform/infrastructure/eks-cluster/-/blob/main/cluster-bootstrap/environments/aws/karpenter.tf?ref_type=heads
Node pool templates: https://code.vt.edu/it-common-platform/platform-support/helm-charts/landlord/-/blob/main/templates/nodepool.yaml?ref_type=heads
Tenant configurations:
Production: https://code.vt.edu/it-common-platform/tenants/it-common-platform-landlord/-/blob/main/prod/tenant-config-0.yaml?ref_type=heads
Pre-production: https://code.vt.edu/it-common-platform/tenants/it-common-platform-landlord/-/blob/main/pprd/tenant-config-0.yaml?ref_type=heads

Best Practices Summary#

Always specify resource requests and limits for predictable performance
Use appropriate node sizes based on workload requirements
Implement HA patterns (replicas, PDBs, anti-affinity) for critical applications
Enable Karpenter and cluster-autoscaler (HPA for pods, Karpenter for nodes) where appropriate
Monitor resource utilization and optimize based on actual usage patterns
Configure node TTLs to balance cost and availability
Separate workloads using node pools and taints/tolerations for isolation