Why Private LLM Deployments Matter
Organizations handling sensitive data — healthcare records, financial documents, proprietary code — cannot afford to send that information to third-party APIs. Private LLM deployments give you full control over data residency, latency, and cost predictability. With open-source models like Llama 3, Mistral, and Phi-3 reaching impressive quality benchmarks, self-hosting has become a viable production strategy.
Ollama simplifies model management by packaging models with their runtime into a single binary, making it the ideal choice for Kubernetes-native deployments.
Prerequisites
Before getting started, ensure your cluster has the following:
- A Kubernetes cluster (v1.28+) with at least one GPU-capable node
- The NVIDIA GPU Operator installed (or equivalent AMD ROCm support)
- kubectl and Helm configured
- A container registry accessible from your cluster
- Prometheus and Grafana for observability (optional but recommended)
Setting Up GPU Nodes
GPU scheduling in Kubernetes requires the NVIDIA device plugin. After installing the GPU Operator via Helm, verify that your nodes advertise GPU resources:
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'For production workloads, we recommend dedicating GPU nodes with taints and tolerations. This prevents non-GPU workloads from being scheduled on expensive GPU instances:
apiVersion: v1
kind: Node
metadata:
labels:
gpu-type: a100
spec:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoScheduleDeploying Ollama on Kubernetes
Create a Deployment that runs Ollama with GPU access. The key is requesting the nvidia.com/gpu resource and mounting a persistent volume for model storage so models survive pod restarts:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai-inference
spec:
replicas: 2
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
memory: "16Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: model-storage
mountPath: /root/.ollama
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: ollama-models-pvcModel Pre-Loading Strategy
In production, you don't want cold starts when the first request arrives. Use an init container or a Job to pre-pull models before traffic hits your pods:
initContainers:
- name: model-loader
image: ollama/ollama:latest
command: ["sh", "-c", "ollama pull llama3:70b && ollama pull mistral:7b"]
volumeMounts:
- name: model-storage
mountPath: /root/.ollamaThis ensures that when the main Ollama container starts, the models are already available on disk, eliminating download latency.
Service and Ingress Configuration
Expose Ollama via a ClusterIP service for internal access, or use an Ingress if you need external access with TLS termination:
apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: ai-inference
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
type: ClusterIPFor production, put an API gateway or authentication proxy in front of the service to manage rate limiting and access control.
Observability with Prometheus and Grafana
Monitoring GPU utilization and inference latency is critical. Deploy the DCGM exporter alongside your GPU nodes to expose GPU metrics to Prometheus:
# ServiceMonitor for DCGM GPU metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
spec:
selector:
matchLabels:
app: dcgm-exporter
endpoints:
- port: metrics
interval: 15sKey metrics to track include GPU utilization percentage, GPU memory usage, inference request latency (p50, p95, p99), model load time, and requests per second per replica.
Build Grafana dashboards that visualize these metrics alongside standard Kubernetes resource usage. Alert on GPU memory approaching limits, as this causes OOM kills that are expensive to recover from.
Scaling Strategies
For horizontal scaling, use KEDA (Kubernetes Event-Driven Autoscaling) to scale Ollama replicas based on request queue depth or GPU utilization:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ollama-scaler
spec:
scaleTargetRef:
name: ollama
minReplicaCount: 2
maxReplicaCount: 8
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: ollama_request_duration_seconds_count
query: rate(ollama_request_duration_seconds_count[2m])
threshold: "10"Cost Optimization Tips
GPU instances are expensive. Here are strategies we use at MBB AI Studio to keep costs manageable:
- Right-size your models — A 7B parameter model on a single GPU often outperforms a 70B model for focused tasks with good prompting
- Use spot/preemptible instances for non-critical inference workloads with graceful shutdown handling
- Implement request batching to maximize GPU throughput per dollar
- Schedule scale-down during off-peak hours using CronJobs or KEDA scheduled scalers
Conclusion
Deploying private LLMs on Kubernetes with Ollama is no longer a niche experiment — it's a production-ready pattern. The combination of Kubernetes orchestration, GPU scheduling, and Ollama's simplicity gives engineering teams a powerful, cost-controlled, and fully private AI inference platform. At MBB AI Studio, we've deployed this architecture for clients across healthcare and financial services, and the results speak for themselves: sub-200ms inference latency, zero data leakage, and predictable monthly costs.