Kubernetes

Kubernetes Resource Optimization for AI Workloads

GPU scheduling, resource requests, and limit tuning strategies for running AI inference workloads efficiently on Kubernetes.

February 202610 min read

The Cost Problem with AI on Kubernetes

Running AI inference on Kubernetes is expensive. A single NVIDIA A100 GPU instance costs upward of $3/hour on major cloud providers. Multiply that by a cluster running multiple models across development, staging, and production, and you're looking at a significant monthly bill. The difference between a well-tuned and a poorly-tuned AI cluster can be tens of thousands of dollars per month.

This article covers practical strategies for squeezing maximum value out of every GPU and CPU cycle in your Kubernetes AI infrastructure.

Understanding GPU Scheduling in Kubernetes

Kubernetes treats GPUs as extended resources via device plugins. Unlike CPU and memory, GPUs cannot be overcommitted — if a pod requests 1 GPU, it gets exclusive access to that physical GPU. This makes right-sizing critical:

yaml
resources:
  requests:
    nvidia.com/gpu: "1"    # Minimum required
    memory: "16Gi"
    cpu: "4"
  limits:
    nvidia.com/gpu: "1"    # Cannot exceed request for GPUs
    memory: "32Gi"
    cpu: "8"

Key rules for GPU scheduling:

  • GPU requests and limits must be equal (no overcommit)
  • GPUs are allocated as whole units — you cannot request 0.5 GPUs natively
  • Pods without GPU requests will never be scheduled on GPU nodes (if taints are configured correctly)

GPU Sharing with Time-Slicing and MIG

For workloads that don't need a full GPU, NVIDIA offers two sharing mechanisms:

Time-Slicing allows multiple pods to share a single GPU by rapidly switching between them. Configure it via the NVIDIA device plugin:

yaml
# nvidia-device-plugin ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4

This exposes each physical GPU as 4 virtual GPUs. Ideal for lightweight inference workloads like embedding generation or small model serving.

Multi-Instance GPU (MIG) physically partitions A100/H100 GPUs into isolated instances, each with dedicated memory and compute. This provides stronger isolation than time-slicing but requires MIG-capable hardware.

Right-Sizing CPU and Memory for AI Pods

AI inference pods have unique resource profiles. Common mistakes include:

Over-requesting CPU — Most inference workloads are GPU-bound, not CPU-bound. A model serving pod typically needs 2-4 CPU cores for request handling and pre/post-processing. Requesting 16 cores wastes schedulable resources.

Under-requesting memory — Model weights must fit in RAM before being loaded to GPU VRAM. A 7B parameter model needs roughly 14GB of RAM for loading (at FP16), plus memory for request buffering. Set memory requests based on model size plus a 30% buffer.

Ignoring ephemeral storage — Model downloads and temporary computation can consume significant ephemeral storage. Set limits to prevent node disk pressure:

yaml
resources:
  requests:
    ephemeral-storage: "20Gi"
  limits:
    ephemeral-storage: "50Gi"

Node Pool Strategy

A multi-pool architecture separates concerns and optimizes cost:

PoolInstance TypePurpose
systemc6i.xlargeControl plane, monitoring, ingress
cpu-workersm6i.2xlargeAPI servers, preprocessing, queues
gpu-inferenceg5.2xlargeModel serving (single GPU)
gpu-trainingp4d.24xlargeFine-tuning jobs (multi-GPU)

Use node affinity and taints to ensure workloads land on the right pool:

yaml
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: node-pool
              operator: In
              values: ["gpu-inference"]

Autoscaling AI Workloads

The Horizontal Pod Autoscaler (HPA) alone doesn't work well for GPU workloads because GPU utilization metrics aren't available by default. Use KEDA with custom Prometheus metrics:

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-scaler
spec:
  scaleTargetRef:
    name: model-serving
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
    - type: prometheus
      metadata:
        query: |
          avg(rate(inference_request_duration_seconds_count[5m]))
        threshold: "50"

Combine pod autoscaling with Cluster Autoscaler or Karpenter to automatically provision GPU nodes when demand spikes. Karpenter is particularly effective because it can select the cheapest available GPU instance type that meets your requirements.

Spot Instances for Non-Critical Workloads

Use spot/preemptible GPU instances for batch inference, evaluation jobs, and development environments. Implement graceful shutdown handling:

python
import signal
import sys

def handle_termination(signum, frame):
    # Finish current inference request
    # Save checkpoint if applicable
    # Drain connections gracefully
    sys.exit(0)

signal.signal(signal.SIGTERM, handle_termination)

With proper interruption handling, spot instances can reduce GPU costs by 60-70%.

Monitoring and Alerting

Essential metrics for AI workload optimization:

  • GPU Utilization — Target 70-85%. Below 50% means you're overpaying; above 90% means queuing
  • GPU Memory — Track allocation vs. usage. Alert at 85% to prevent OOM
  • Inference Latency — p99 latency is your SLA metric. Alert on sustained increases
  • Queue Depth — Rising queue depth signals you need to scale out
  • Cost per Inference — Divide total compute cost by inference count for unit economics

Conclusion

Kubernetes resource optimization for AI workloads isn't optional — it's the difference between a sustainable AI platform and a money pit. Start with proper node pool segmentation, right-size your GPU and memory requests, implement GPU sharing for lightweight workloads, and build autoscaling around custom inference metrics. At MBB AI Studio, we typically achieve 40-60% cost reduction for clients through these optimizations without sacrificing inference quality or latency.