← Back to ollama/ollama

How to Deploy & Use ollama/ollama

Ollama Deployment and Usage Guide

Prerequisites

System Requirements

  • OS: macOS 11+, Windows 10/11, or Linux (Ubuntu 20.04+, Fedora 36+, Debian 11+)
  • Architecture: x86_64, ARM64 (Apple Silicon), or ARMv7
  • Memory: 8GB+ RAM (16GB+ recommended for larger models)
  • Storage: 10GB+ free space for models
  • Optional: NVIDIA GPU (CUDA 11.8+) or AMD GPU (ROCm 5.5+) for acceleration

Build Requirements (Source)

  • Go: 1.22 or later
  • Git: 2.0+
  • C++ Compiler: GCC 11+ or Clang 14+ (for CGO dependencies)
  • CMake: 3.24+ (for llama.cpp backend compilation)

Optional Tools

  • Docker: 20.10+ (for containerized deployment)
  • Python: 3.8+ (for Python SDK usage)
  • Node.js: 18+ (for JavaScript SDK usage)

Installation

Quick Install (Recommended)

macOS/Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows (PowerShell):

irm https://ollama.com/install.ps1 | iex

Manual Download:

Docker Installation

# Pull official image
docker pull ollama/ollama

# Run with persistent storage
docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Build from Source

# Clone repository
git clone https://github.com/ollama/ollama.git
cd ollama

# Build (requires Go 1.22+ and C++ compiler)
go build -o ollama .

# Or use the build script
make

Client Libraries

Python:

pip install ollama

JavaScript/TypeScript:

npm install ollama

Configuration

Environment Variables

Create a systemd override or export in your shell:

# Server configuration
export OLLAMA_HOST=0.0.0.0:11434        # Bind address (default: 127.0.0.1:11434)
export OLLAMA_MODELS=/path/to/models    # Model storage location
export OLLAMA_ORIGINS=*                  # CORS origins (comma-separated)
export OLLAMA_KEEP_ALIVE=5m              # Keep models loaded duration
export OLLAMA_NUM_PARALLEL=4             # Parallel request handling
export OLLAMA_MAX_LOADED_MODELS=2        # Max models in memory simultaneously

# GPU Configuration
export CUDA_VISIBLE_DEVICES=0            # Specific NVIDIA GPU
export HIP_VISIBLE_DEVICES=0             # Specific AMD GPU
export OLLAMA_GPU_OVERHEAD=1GB           # Reserve VRAM for system

Systemd Service (Linux)

Create /etc/systemd/system/ollama.service.d/override.conf:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/var/lib/ollama/models"

Reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Integration Configuration

Configure AI coding assistants to use Ollama:

Claude Code:

ollama launch claude
# Or manually configure ~/.claude/settings.json

Codex:

ollama launch codex

OpenCode/Droid:

ollama launch opencode
ollama launch droid

Database (Desktop App)

The desktop app stores settings in SQLite (macOS/Windows):

  • macOS: ~/Library/Application Support/Ollama/database.sqlite
  • Windows: %LOCALAPPDATA%\Ollama\database.sqlite
  • Schema version: 13 (auto-migrated on startup)

Build & Run

Development Mode

# Start server with debug logging
OLLAMA_DEBUG=1 ./ollama serve

# In another terminal, run a model
./ollama run gemma3

# Or start interactive mode
./ollama

Production Build

# Optimized build
go build -ldflags="-s -w" -o ollama .

# Run as background service
./ollama serve &

GPU Acceleration

NVIDIA:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

AMD:

docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

API Usage Examples

REST API:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma3",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": false
}'

Python:

from ollama import chat
response = chat(model='gemma3', messages=[
  {'role': 'user', 'content': 'Why is the sky blue?'}
])
print(response.message.content)

JavaScript:

import ollama from 'ollama'
const response = await ollama.chat({
  model: 'gemma3',
  messages: [{ role: 'user', content: 'Why is the sky blue?' }]
})
console.log(response.message.content)

Deployment

Docker Compose

version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=4

  webui:
    image: open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    volumes:
      - open-webui:/app/backend/data

volumes:
  ollama:
  open-webui:

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: ollama-storage
          mountPath: /root/.ollama
      volumes:
      - name: ollama-storage
        persistentVolumeClaim:
          claimName: ollama-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434
  type: ClusterIP

Cloud Deployment

AWS (EC2 with GPU):

# Use Deep Learning AMI
docker run -d --gpus all -p 11434:11434 -v ollama:/root/.ollama ollama/ollama

# Configure security group to allow port 11434

Reverse Proxy (Nginx):

location /ollama/ {
    proxy_pass http://localhost:11434/;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_buffering off;
}

Troubleshooting

Port Already in Use

Error: bind: address already in use

# Find process using port 11434
lsof -i :11434

# Kill existing Ollama process
pkill ollama

# Or change port
OLLAMA_HOST=127.0.0.1:11435 ollama serve

Model Download Failures

Issue: Interrupted downloads or checksum errors

# Remove corrupted model
rm -rf ~/.ollama/models/blobs/sha256-*

# Re-pull model
ollama pull gemma3

GPU Not Detected

Symptoms: Slow inference, CPU usage high

# Verify NVIDIA drivers
nvidia-smi

# Check Docker GPU runtime
docker run --rm --gpus all ollama/ollama nvidia-smi

# Force CPU mode (if needed)
OLLAMA_NO_GPU=1 ollama serve

Out of Memory

Error: runtime error: out of memory

# Reduce parallel requests
export OLLAMA_NUM_PARALLEL=1

# Limit context window (in Modelfile)
PARAMETER num_ctx 2048

# Use smaller quantization
ollama pull gemma3:2b

Permission Denied (Linux)

Fix:

# Add user to ollama group
sudo usermod -aG ollama $USER

# Or fix permissions
sudo chown -R $USER:$USER ~/.ollama

Integration Connection Issues

Claude/Codex not connecting:

  • Verify Ollama server is running: curl http://localhost:11434/api/tags
  • Check integration config paths in ~/.claude/ or ~/.codex/
  • Ensure OLLAMA_ORIGINS includes the integration's origin

Database Lock (Desktop App)

Error: database is locked

# Desktop app uses WAL mode - avoid network drives for database
# Reset if corrupted:
mv ~/Library/Application\ Support/Ollama/database.sqlite ~/Library/Application\ Support/Ollama/database.sqlite.bak

Build Errors

CGO errors:

# macOS
xcode-select --install

# Ubuntu/Debian
sudo apt-get install build-essential cmake

# Fedora
sudo dnf install gcc gcc-c++ cmake