Running Ollama in Docker and Kubernetes

ollama-docker-kubernetes-thumbnail

How to Deploy Ollama with Docker and Kubernetes

Running Ollama on your laptop is great for testing. But the moment you want to ship it to a real product — with uptime, scaling, and team access — you need it containerized and orchestrated properly. This guide walks you from a single Docker container to a working Kubernetes deployment, step by step.

⏱️ Time to Complete

Around 20–30 minutes for the full Docker + Kubernetes setup.

🎯 What you'll learn

  • How to run Ollama in a Docker container with GPU access
  • How to persist models with Docker volumes (so you don't re-download on every restart)
  • A ready-to-use docker-compose.yml for local and small production setups
  • How to deploy Ollama on Kubernetes with a GPU-enabled pod
  • Health checks and basic scaling considerations

📋 Prerequisites

  • Docker installed and running
  • (Optional, for GPU support) NVIDIA Container Toolkit installed on the host
  • (For the Kubernetes section) A working cluster — Minikube is fine for testing, a managed cluster (EKS/GKE/AKS) for production

🐳 Step 1: Run Ollama in a Single Docker Container

Ollama running in a Docker container with a persistent model volume

The official Ollama image makes this a one-liner:

docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama_data:/root/.ollama \
  ollama/ollama

What this does:

  • -d runs it in the background
  • -p 11434:11434 exposes the Ollama API on your host
  • -v ollama_data:/root/.ollama persists downloaded models in a named volume, so they survive container restarts

Pull and run a model exactly like you would natively:

docker exec -it ollama ollama run llama3.2

[!TIP] Without the volume mount, every container restart wipes your downloaded models — and re-downloading a multi-gigabyte model every time gets old fast.


🎮 Step 2: Enable GPU Access in Docker

NVIDIA GPU passthrough connecting host GPU hardware into an Ollama Docker container

By default, Docker containers can't see your GPU. If you have an NVIDIA GPU and want Ollama to use it:

docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v ollama_data:/root/.ollama \
  ollama/ollama

The --gpus all flag requires the NVIDIA Container Toolkit to be installed on the host machine first. Verify it's working:

docker exec -it ollama nvidia-smi

If you see your GPU listed, you're good. If not, Ollama will silently fall back to CPU — which still works, just slower.


📦 Step 3: Docker Compose for a Repeatable Setup

Docker Compose stack connecting Ollama service, API port, GPU access, and persistent volume

For anything beyond a quick test, use docker-compose.yml so your whole team can spin up the same environment:

services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:

Start it:

docker compose up -d

[!NOTE] The deploy.resources GPU block only takes effect with Docker Compose's Swarm-aware GPU support. On plain docker compose (non-Swarm), you may need runtime: nvidia instead, depending on your Docker version. Test with nvidia-smi inside the container to confirm.


☸️ Step 4: Deploy Ollama on Kubernetes

Ollama Kubernetes deployment with GPU pod, persistent volume, service endpoint, and health checks

For real production scale, Kubernetes gives you restarts, scaling, and rolling updates for free. Here's a minimal but functional deployment.

Deployment manifest

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama
          ports:
            - containerPort: 11434
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: ollama-data
              mountPath: /root/.ollama
      volumes:
        - name: ollama-data
          persistentVolumeClaim:
            claimName: ollama-pvc

Persistent storage claim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi

Service to expose it inside the cluster

apiVersion: v1
kind: Service
metadata:
  name: ollama
spec:
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434
  type: ClusterIP

Apply all three:

kubectl apply -f ollama-pvc.yaml
kubectl apply -f ollama-deployment.yaml
kubectl apply -f ollama-service.yaml

[!IMPORTANT] The nvidia.com/gpu: 1 resource request only works if your cluster has the NVIDIA device plugin for Kubernetes installed and GPU-enabled nodes. Without it, remove the resources.limits block to run on CPU.


🩺 Step 5: Add a Health Check

Don't let Kubernetes (or Docker) think Ollama is healthy when it isn't. Add a basic probe to the deployment spec:

          readinessProbe:
            httpGet:
              path: /
              port: 11434
            initialDelaySeconds: 10
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /
              port: 11434
            initialDelaySeconds: 30
            periodSeconds: 30

This makes sure traffic only routes to the pod once Ollama's API actually responds, and restarts the pod automatically if it crashes.


📈 Scaling Considerations

Ollama is not stateless in the way a typical web app is — each replica needs its own loaded model in GPU memory. Keep this in mind:

ApproachWhen to use
Single replica, bigger GPUMost small-to-medium teams; simplest to operate
Multiple replicas, one model eachHigh request volume, willing to pay for more GPUs
Load balancer in front of replicasOnce you have 2+ replicas, to distribute requests evenly

[!TIP] Horizontal Pod Autoscaler (HPA) based on CPU doesn't work well for GPU-bound workloads. If you need autoscaling, scale based on GPU utilization metrics or request queue depth instead.


✅ Quick Reference Cheatsheet

TaskCommand
Run Ollama in Dockerdocker run -d -p 11434:11434 -v ollama_data:/root/.ollama ollama/ollama
Run with GPUadd --gpus all
Run a model inside containerdocker exec -it ollama ollama run llama3.2
Start via Composedocker compose up -d
Apply Kubernetes manifestskubectl apply -f <file>.yaml
Check GPU visibility in containerdocker exec -it ollama nvidia-smi
Check pod statuskubectl get pods
View pod logskubectl logs -f deployment/ollama

🎁 Final Tips

  1. Always persist /root/.ollama — losing this means re-downloading every model from scratch.
  2. Test GPU passthrough before deploying — a silent CPU fallback is easy to miss until performance tanks under load.
  3. Start with one replica — Ollama deployments are GPU-memory bound, not request-count bound. Scale only after you measure real bottlenecks.
  4. Put it behind a reverse proxy with auth if it's reachable beyond your cluster's internal network — see the network setup guide for the same principles applied to containers.