Why GitOps for AI?
AI model deployments have a unique challenge: you're not just deploying code — you're deploying code plus model weights plus configuration plus inference parameters. Traditional CI/CD pipelines struggle with this multi-artifact lifecycle. GitOps, with its declarative and version-controlled approach, provides an elegant solution.
ArgoCD watches your Git repository and continuously reconciles your Kubernetes cluster to match the declared state. When you update a model version in Git, ArgoCD automatically rolls it out. When something breaks, you git revert and ArgoCD rolls it back. Every deployment is auditable, reproducible, and reviewable through standard pull request workflows.
Repository Structure for AI Deployments
We recommend a structured monorepo approach for AI model deployments:
ai-deployments/
base/
model-serving/
deployment.yaml
service.yaml
hpa.yaml
configmap.yaml
monitoring/
servicemonitor.yaml
alerts.yaml
overlays/
development/
kustomization.yaml
model-config.yaml
staging/
kustomization.yaml
model-config.yaml
production/
kustomization.yaml
model-config.yaml
models/
llama3-70b/
model-config.yaml # version, parameters, resource requirements
mistral-7b/
model-config.yamlUsing Kustomize overlays lets you maintain environment-specific configurations (smaller replicas in dev, larger resource limits in prod) while sharing the base templates.
Configuring ArgoCD Applications
Define an ArgoCD Application for each model deployment:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: llama3-production
namespace: argocd
spec:
project: ai-models
source:
repoURL: https://github.com/your-org/ai-deployments
targetRevision: main
path: overlays/production
destination:
server: https://kubernetes.default.svc
namespace: ai-inference
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
retry:
limit: 3
backoff:
duration: 30s
factor: 2The selfHeal option is important — it reverts any manual kubectl changes, ensuring Git remains the single source of truth.
Model Version Management
Store model versions and configurations as Kubernetes ConfigMaps versioned in Git:
apiVersion: v1
kind: ConfigMap
metadata:
name: model-config
data:
MODEL_NAME: "llama3"
MODEL_VERSION: "3.1-70b-instruct"
MODEL_REVISION: "a2431349ba"
MAX_BATCH_SIZE: "32"
MAX_SEQUENCE_LENGTH: "4096"
QUANTIZATION: "awq-int4"
TEMPERATURE_DEFAULT: "0.7"When you need to update the model version, you create a PR that changes the MODEL_VERSION and MODEL_REVISION fields. The PR triggers CI tests (smoke tests, latency benchmarks, quality evaluations), and upon merge, ArgoCD deploys the new version.
Progressive Rollouts with Argo Rollouts
For AI models, a bad deployment can silently degrade quality without crashing. Combine ArgoCD with Argo Rollouts for canary deployments that validate model quality:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-serving
spec:
replicas: 5
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 5m }
- analysis:
templates:
- templateName: model-quality-check
- setWeight: 50
- pause: { duration: 10m }
- analysis:
templates:
- templateName: model-quality-check
- setWeight: 100
canaryMetadata:
labels:
role: canary
stableMetadata:
labels:
role: stableThe AnalysisTemplate runs automated quality checks at each stage:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: model-quality-check
spec:
metrics:
- name: inference-latency
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99,
rate(inference_duration_seconds_bucket{role="canary"}[5m]))
successCondition: result < 0.5
- name: error-rate
provider:
prometheus:
address: http://prometheus:9090
query: |
rate(inference_errors_total{role="canary"}[5m])
/ rate(inference_requests_total{role="canary"}[5m])
successCondition: result < 0.01If latency exceeds 500ms or error rate exceeds 1%, the rollout automatically aborts and reverts to the stable version.
Secrets Management
AI deployments often need API keys (for embedding services, monitoring, etc.) and model registry credentials. Never store these in Git. Use External Secrets Operator with ArgoCD:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: model-registry-creds
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: model-registry-creds
data:
- secretKey: username
remoteRef:
key: /ai-platform/model-registry
property: username
- secretKey: password
remoteRef:
key: /ai-platform/model-registry
property: passwordMulti-Cluster Deployments
For organizations running AI across multiple clusters (edge, regional, cloud), ArgoCD's ApplicationSet controller generates Applications dynamically:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: model-serving-global
spec:
generators:
- clusters:
selector:
matchLabels:
ai-capable: "true"
template:
metadata:
name: 'model-serving-{{name}}'
spec:
source:
repoURL: https://github.com/your-org/ai-deployments
path: 'overlays/{{metadata.labels.environment}}'
destination:
server: '{{server}}'
namespace: ai-inferenceRollback Strategy
One of GitOps' greatest strengths is rollback simplicity. If a new model version degrades performance:
1. Open a revert PR on the model config change 2. Merge (or have ArgoCD auto-sync from a previous commit) 3. ArgoCD reconciles, redeploying the previous model version
The entire rollback is tracked in Git history, providing a clear audit trail of what changed, when, and why.
Conclusion
GitOps with ArgoCD transforms AI model deployments from ad-hoc processes into reliable, auditable, and automated workflows. The combination of declarative configuration, progressive rollouts with quality gates, and simple Git-based rollbacks gives teams confidence to deploy model updates frequently. At MBB AI Studio, GitOps is a cornerstone of every AI platform we build — it brings the same rigor to model deployments that software engineering has long applied to application code.