Deploying Private LLMs with Ollama and Kubernetes: A Production Guide

Why Private LLM Deployments Matter

Organizations handling sensitive data — healthcare records, financial documents, proprietary code — cannot afford to send that information to third-party APIs. Private LLM deployments give you full control over data residency, latency, and cost predictability. With open-source models like Llama 3, Mistral, and Phi-3 reaching impressive quality benchmarks, self-hosting has become a viable production strategy.

Ollama simplifies model management by packaging models with their runtime into a single binary, making it the ideal choice for Kubernetes-native deployments.

Prerequisites

Before getting started, ensure your cluster has the following:

A Kubernetes cluster (v1.28+) with at least one GPU-capable node
The NVIDIA GPU Operator installed (or equivalent AMD ROCm support)
kubectl and Helm configured
A container registry accessible from your cluster
Prometheus and Grafana for observability (optional but recommended)

Setting Up GPU Nodes

GPU scheduling in Kubernetes requires the NVIDIA device plugin. After installing the GPU Operator via Helm, verify that your nodes advertise GPU resources:

bash

kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'

For production workloads, we recommend dedicating GPU nodes with taints and tolerations. This prevents non-GPU workloads from being scheduled on expensive GPU instances:

yaml

apiVersion: v1
kind: Node
metadata:
  labels:
    gpu-type: a100
spec:
  taints:
    - key: nvidia.com/gpu
      value: "true"
      effect: NoSchedule

Deploying Ollama on Kubernetes

Create a Deployment that runs Ollama with GPU access. The key is requesting the nvidia.com/gpu resource and mounting a persistent volume for model storage so models survive pod restarts:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            requests:
              memory: "16Gi"
              cpu: "4"
              nvidia.com/gpu: "1"
            limits:
              memory: "32Gi"
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-storage
              mountPath: /root/.ollama
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: ollama-models-pvc

Model Pre-Loading Strategy

In production, you don't want cold starts when the first request arrives. Use an init container or a Job to pre-pull models before traffic hits your pods:

yaml

initContainers:
  - name: model-loader
    image: ollama/ollama:latest
    command: ["sh", "-c", "ollama pull llama3:70b && ollama pull mistral:7b"]
    volumeMounts:
      - name: model-storage
        mountPath: /root/.ollama

This ensures that when the main Ollama container starts, the models are already available on disk, eliminating download latency.

Service and Ingress Configuration

Expose Ollama via a ClusterIP service for internal access, or use an Ingress if you need external access with TLS termination:

yaml

apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ai-inference
spec:
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434
  type: ClusterIP

For production, put an API gateway or authentication proxy in front of the service to manage rate limiting and access control.

Observability with Prometheus and Grafana

Monitoring GPU utilization and inference latency is critical. Deploy the DCGM exporter alongside your GPU nodes to expose GPU metrics to Prometheus:

yaml

# ServiceMonitor for DCGM GPU metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

Key metrics to track include GPU utilization percentage, GPU memory usage, inference request latency (p50, p95, p99), model load time, and requests per second per replica.

Build Grafana dashboards that visualize these metrics alongside standard Kubernetes resource usage. Alert on GPU memory approaching limits, as this causes OOM kills that are expensive to recover from.

Scaling Strategies

For horizontal scaling, use KEDA (Kubernetes Event-Driven Autoscaling) to scale Ollama replicas based on request queue depth or GPU utilization:

yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ollama-scaler
spec:
  scaleTargetRef:
    name: ollama
  minReplicaCount: 2
  maxReplicaCount: 8
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: ollama_request_duration_seconds_count
        query: rate(ollama_request_duration_seconds_count[2m])
        threshold: "10"

Cost Optimization Tips

GPU instances are expensive. Here are strategies we use at MBB AI Studio to keep costs manageable:

Right-size your models — A 7B parameter model on a single GPU often outperforms a 70B model for focused tasks with good prompting
Use spot/preemptible instances for non-critical inference workloads with graceful shutdown handling
Implement request batching to maximize GPU throughput per dollar
Schedule scale-down during off-peak hours using CronJobs or KEDA scheduled scalers

Conclusion

Deploying private LLMs on Kubernetes with Ollama is no longer a niche experiment — it's a production-ready pattern. The combination of Kubernetes orchestration, GPU scheduling, and Ollama's simplicity gives engineering teams a powerful, cost-controlled, and fully private AI inference platform. At MBB AI Studio, we've deployed this architecture for clients across healthcare and financial services, and the results speak for themselves: sub-200ms inference latency, zero data leakage, and predictable monthly costs.

Back to all articles