Skip to content

Node and Pod Sizing#

Pod sizing is a critical aspect of Kubernetes resource management that directly impacts application performance, cluster efficiency, and operational costs. This guide covers pod sizing configurations across both on-premise (EKS Anywhere) and AWS (EKS with Karpenter) environments in the IT Common Platform.

On-Premise Pod Definitions (EKS Anywhere)#

The platform's on-premise infrastructure runs on Amazon EKS Anywhere (EKS-A) deployed on VMware vSphere. This provides a consistent Kubernetes experience across data center environments.

Node Size Specifications#

The platform defines standardized node sizes for on-premise deployments in https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/nodesize.yaml?ref_type=heads:

Size Category
tiny Minimal resources for lightweight workloads
small Standard workloads with modest requirements
medium Applications with moderate resource needs
large Resource-intensive applications
xlarge High-performance computing workloads

For current specifications, refer to the nodesize.yaml configuration file.

Environment-Specific Node Pools#

Each environment maintains distinct node pool configurations optimized for workload requirements:

Development Environment#

Configuration file: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/dvlp.yaml?ref_type=heads

workerNodeGroupConfigurations:
  - name: es
    count: 2
    machineRef: small  # 2 CPU, 8GB RAM
  - name: platform
    count: 4
    machineRef: small
  - name: core
    count: 3
    autoscalingConfiguration:
      minCount: 3
      maxCount: 5
    machineRef: small

Production Environment#

Configuration file: https://code.vt.edu/it-common-platform/infrastructure/eksa-vsphere/-/blob/main/prod.yaml?ref_type=heads

workerNodeGroupConfigurations:
  - name: es
    count: 6
    autoscalingConfiguration:
      minCount: 6
      maxCount: 9
    machineRef: xlarge  # 8 CPU, 64GB RAM
  - name: sis
    count: 2
    machineRef: tiny    # 1 CPU, 4GB RAM
  - name: platform
    count: 2
    machineRef: small   # 2 CPU, 8GB RAM
  - name: core
    count: 3
    autoscalingConfiguration:
      minCount: 3
      maxCount: 8
    machineRef: small

Control Plane Sizing#

Control plane components use consistent sizing across environments for stability:

controlPlaneConfiguration:
  machineSpec:
    numCPUs: 2
    memoryMiB: 32768
    diskGiB: 26

externalEtcdConfiguration:
  machineSpec:
    numCPUs: 2
    memoryMiB: 4096
    diskGiB: 26

Applying On-Premise Configuration Changes#

To modify node pool sizes in an on-premise cluster:

  1. Edit the appropriate environment file:

    vim it-common-platform/infrastructure/eksa-vsphere/prod.yaml
    

  2. Update the node pool configuration:

    workerNodeGroupConfigurations:
      - name: your-pool
        count: 4
        machineRef: medium  # Change size reference
    

  3. Apply changes using using the pipeline

Karpenter Autoscaling (AWS EKS)#

For AWS-based deployments, the platform uses Karpenter for intelligent node provisioning and autoscaling. Karpenter automatically provisions right-sized compute resources based on pod requirements.

Karpenter Configuration#

Base configuration: https://code.vt.edu/it-common-platform/infrastructure/eks-cluster/-/blob/main/cluster-bootstrap/environments/aws/karpenter.tf?ref_type=heads

Karpenter installation includes: - Namespace: platform-karpenter - Controller resources are configured appropriately for the cluster size

Node Pool Templates#

Karpenter node pools are managed through the landlord system, which provides tenant-specific node pools with resource isolation.

Template location: https://code.vt.edu/it-common-platform/platform-support/helm-charts/landlord/-/blob/main/templates/nodepool.yaml?ref_type=heads

Example Node Pool Configuration#

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: platform
spec:
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 300s  # Scale down empty nodes after 5 minutes

  limits:
    cpu: 1000
    memory: 1000Gi

  template:
    spec:
      expireAfter: 604800s  # Node TTL: 7 days

      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t3a.medium
        - t3a.large

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: platform

      taints:
      - key: platform.it.vt.edu/node-pool
        value: platform
        effect: NoSchedule

Production Node Pool Examples#

Configuration file: https://code.vt.edu/it-common-platform/tenants/it-common-platform-landlord/-/blob/main/prod/tenant-config-0.yaml?ref_type=heads

Pool Name Purpose Instance Strategy
es Enterprise Services workloads Mixed instance types for flexibility
harbor Container registry Optimized for storage and network
nis-apps NIS application workloads Scaled for application requirements
platform Platform services Balanced compute and memory
sis Student Information Systems Sized for transactional workloads

For current configurations, refer to the appropriate tenant configuration files.

Configuring Karpenter Node Pools#

To create or modify a Karpenter node pool:

  1. Edit the tenant configuration:

    vim .../it-common-platform/tenants/it-common-platform-landlord/prod/tenant-config-0.yaml
    

  2. Add or modify a node pool definition:

    nodePools:
      - name: my-app
        instanceTypes:
          - t3a.medium
          - t3a.large
        limits:
          cpu: 10
          memory: 32Gi
        emptyTtl: 300
        costCode: "CC12345"
        capacityTypes: ["on-demand", "spot"]  # Optional: enable spot instances
    

High Availability (HA) Considerations#

High availability is achieved through multiple strategies across the platform:

Replica Management#

Applications should define appropriate replica counts for availability:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3  # Minimum 2 for HA
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # no pods can be unavailable during an update to the deployment

Pod Disruption Budgets#

Protect applications during cluster maintenance with PDBs:

Example from: https://code.vt.edu/it-common-platform/platform-support/helm-charts/simple-reverse-proxy/-/blob/main/templates/poddisruptionbudget.yaml?ref_type=heads

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 1  # Or use maxUnavailable
  selector:
    matchLabels:
      app: my-app

Anti-Affinity Rules#

Distribute pods across nodes for resilience:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - my-app
              topologyKey: kubernetes.io/hostname

Control Plane HA#

Both on-premise and AWS environments maintain highly available control planes:

  • EKS-A: 3 control plane nodes, 3 etcd nodes
  • EKS: AWS-managed control plane with multi-AZ deployment

Node Efficiency Best Practices#

Resource Requests and Limits#

Always define resource requests and limits for predictable scheduling and performance:

Example from: https://code.vt.edu/it-common-platform/tenants/aws-prod/itsl-covervt/-/blob/main/manifest.yaml?ref_type=heads

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            memory: "32Mi"
            cpu: "50m"
          limits:
            memory: "64Mi"
            cpu: "250m"

Resource Sizing Guidelines#

Workload Type CPU Request CPU Limit Memory Request Memory Limit
Microservice 50m-100m 250m-500m 64Mi-128Mi 256Mi-512Mi
Web App 100m-250m 500m-1000m 256Mi-512Mi 1Gi-2Gi
Database 500m-1000m 2000m-4000m 1Gi-2Gi 4Gi-8Gi
Batch Job 250m-500m 1000m-2000m 512Mi-1Gi 2Gi-4Gi

Quality of Service Classes#

Kubernetes assigns QoS classes based on resource specifications:

  1. Guaranteed: Requests equal limits for all resources

    resources:
      requests:
        cpu: "1"
        memory: "1Gi"
      limits:
        cpu: "1"
        memory: "1Gi"
    

  2. Burstable: At least one request or limit set

    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
    

  3. BestEffort: No requests or limits (not recommended for production)

Node Pool Efficiency#

Optimize node pool configurations for cost and performance:

  1. Right-size instance types: Match instance capabilities to workload requirements
  2. Use spot instances for fault-tolerant workloads:
    capacityTypes: ["on-demand", "spot"]
    
  3. Configure appropriate TTLs for empty nodes:
    emptyTtl: 300  # 5 minutes for production
    
  4. Set node expiration for automatic refresh:
    expireAfter: 604800s  # 7 days
    

Monitoring and Optimization#

Resource Utilization Metrics#

Monitor pod and node resource utilization through the platform's monitoring stack:

  1. Prometheus metrics: CPU and memory usage per pod/node
  2. Kubecost: Cost analysis and optimization recommendations
  3. Grafana dashboards: Visualize resource trends

Optimization Process#

  1. Analyze current usage:

    kubectl top pods -n my-namespace
    kubectl top nodes
    

  2. Review resource requests vs actual usage:

    kubectl describe pod my-pod -n my-namespace
    

  3. Adjust resource specifications based on observed patterns

  4. Implement horizontal pod autoscaling for dynamic scaling:

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: my-app-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: my-app
      minReplicas: 2
      maxReplicas: 10
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70
    

Common Configuration Files#

When implementing node sizing, these are the key files to modify:

On-Premise (EKS-A)#

AWS (EKS with Karpenter)#

Best Practices Summary#

  1. Always specify resource requests and limits for predictable performance
  2. Use appropriate node sizes based on workload requirements
  3. Implement HA patterns (replicas, PDBs, anti-affinity) for critical applications
  4. Enable Karpenter and cluster-autoscaler (HPA for pods, Karpenter for nodes) where appropriate
  5. Monitor resource utilization and optimize based on actual usage patterns
  6. Configure node TTLs to balance cost and availability
  7. Separate workloads using node pools and taints/tolerations for isolation