PyTorch Deployment and Usage Guide

Prerequisites

System Requirements

Python: 3.8–3.12 (64-bit)
Operating System: Linux (recommended), macOS, Windows
Git: For source builds
C++ Compiler: GCC 9.4+ or Clang (source builds only)

GPU Acceleration (Optional)

NVIDIA: CUDA 11.8 or 12.1 compatible drivers (450.80.02+ for CUDA 11.8, 525.60.13+ for CUDA 12.1)
AMD: ROCm 5.7+ (Linux only)
Intel: Intel Extension for PyTorch (IPEX) for Intel GPUs

Build Dependencies (Source Installation)

# Ubuntu/Debian
sudo apt-get install python3-dev python3-pip python3-venv git cmake ninja-build

# macOS
brew install cmake ninja pkg-config

# Windows
# Install Visual Studio 2019 or newer with C++ build tools

Installation

Method 1: Pre-built Binaries (Recommended)

Via pip (with CUDA 12.1):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Via pip (CPU only):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Via conda:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

NVIDIA Jetson (ARM64):

# Use NVIDIA JetPack SDK pre-built wheels
pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cu121

Method 2: Build from Source

Clone repository (recursive for submodules):

git clone --recursive https://github.com/pytorch/pytorch.git
cd pytorch
git submodule update --init --recursive

Install build dependencies:

pip install -r requirements.txt
# Or for development:
pip install -e .

Configure build (optional):

export USE_CUDA=1
export USE_CUDNN=1
export USE_MKLDNN=1
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

Build and install:

python setup.py install
# Or for development (editable install):
python setup.py develop

Method 3: Docker

Pre-built images:

# CUDA 12.1 runtime
docker run --gpus all -it --rm pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime

# Development image
docker run --gpus all -it --rm -v $(pwd):/workspace pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel

Build custom image:

docker build -t pytorch-custom -f Dockerfile .

Configuration

Environment Variables

GPU Configuration:

# Limit visible GPUs
export CUDA_VISIBLE_DEVICES=0,1

# Memory management
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# CUDA deterministic operations (reproducibility)
export CUBLAS_WORKSPACE_CONFIG=:4096:8

Distributed Training (from torch.distributed.elastic):

# Rendezvous backend configuration
export RDZV_BACKEND=etcd
export RDZV_ENDPOINT=etcd.example.com:2379
export RDZV_ID=job123
export TORCHELASTIC_MAX_RESTARTS=3

Performance Tuning:

# Intel MKL threads
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4

# Disable MKL-DNN if needed
export USE_MKLDNN=0

Configuration Files

Distributed Rendezvous (dynamic_rendezvous.py): Create rdzv_config.yaml:

backend: etcd
endpoint: localhost:2379
timeout: 300
protocol: http

Build & Run

Local Development Build

Fast development build (no optimizations):

DEBUG=1 USE_DISTRIBUTED=0 USE_MKLDNN=0 USE_CUDA=0 python setup.py develop

Full optimized build:

USE_CUDA=1 USE_CUDNN=1 USE_MKLDNN=1 python setup.py install

Verification

Test installation:

import torch

# Check version and CUDA availability
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Device count: {torch.cuda.device_count()}")

# Test tensor operations
x = torch.rand(5, 3)
if torch.cuda.is_available():
    x = x.cuda()
    print(f"Tensor on {x.device}")

Running Tests

# Install test dependencies
pip install pytest pytest-xdist hypothesis

# Run core tests
python -m pytest test/test_torch.py -v -x

# Run distributed tests (requires multiple GPUs)
python -m pytest test/distributed/test_c10d_common.py -v

# Run specific module tests (e.g., profiler)
python -m pytest test/profiler/test_memory_profiler.py -v

Distributed Training Launch

Single-node multi-GPU:

python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=1 \
    --master_port=29500 \
    train.py

Elastic launch (with DynamicRendezvousHandler):

torchrun \
    --nnodes=2:4 \
    --nproc_per_node=8 \
    --rdzv_id=job123 \
    --rdzv_backend=etcd \
    --rdzv_endpoint=etcd.example.com:2379 \
    train.py

Deployment

Cloud Platforms

AWS Deep Learning AMI:

# Use pre-configured PyTorch environments
source activate pytorch

Google Cloud AI Platform:

gcloud ai-platform jobs submit training $JOB_NAME \
  --region $REGION \
  --master-image-uri gcr.io/cloud-ml-images/pytorch:latest \
  --scale-tier CUSTOM \
  --master-machine-type n1-standard-8 \
  --master-accelerator count=1,type=nvidia-tesla-t4

Azure Machine Learning:

from azureml.core import Environment
env = Environment.from_pip_requirements(name='pytorch-env', file_path='requirements.txt')
env.docker.enabled = True
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04'

Container Orchestration

Kubernetes:

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: pytorch
    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
    resources:
      limits:
        nvidia.com/gpu: 1
    command: ["python", "train.py"]

Docker Compose (multi-node simulation):

version: '3.8'
services:
  pytorch-master:
    image: pytorch/pytorch:latest
    runtime: nvidia
    environment:
      - MASTER_ADDR=pytorch-master
      - MASTER_PORT=29500
      - WORLD_SIZE=2
      - RANK=0
    command: python train.py
  
  pytorch-worker:
    image: pytorch/pytorch:latest
    runtime: nvidia
    environment:
      - MASTER_ADDR=pytorch-master
      - MASTER_PORT=29500
      - WORLD_SIZE=2
      - RANK=1
    command: python train.py

Edge Deployment (NVIDIA Jetson)

# On Jetson device with JetPack 5.1+
sudo apt-get install python3-pip libopenblas-base libopenmpi-dev libomp-dev
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121

Model Serving (TorchServe)

# Install TorchServe
pip install torchserve torch-model-archiver

# Archive model
torch-model-archiver --model-name my_model --version 1.0 --model-file model.py --serialized-file model.pth --handler custom_handler.py

# Start server
torchserve --start --model-store model_store --models my_model=my_model.mar

Troubleshooting

Build Issues

Error: CMake Error: Could not find CMAKE_ROOT:

pip install cmake --upgrade
export CMAKE_ROOT=$(python -c "import cmake; print(cmake.__path__[0])")

Error: nvcc fatal : Unsupported gpu architecture 'compute_86':

# Set specific CUDA architectures for your GPU
export TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0"
python setup.py install

Error: undefined symbol: _ZN3c106detail...:

# Clean build required
python setup.py clean
rm -rf build
python setup.py install

Runtime Issues

CUDA Out of Memory (from _memory_profiler.py usage):

# Enable memory profiling to diagnose
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True) as prof:
    # your code
print(prof.key_averages().table(sort_by="cuda_memory_usage"))

Solution:

# Empty cache between iterations
torch.cuda.empty_cache()
# Or set environment variable
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Distributed Rendezvous Timeout:

# Increase timeout in handler
from torch.distributed.elastic.rendezvous import RendezvousTimeout
timeout = RendezvousTimeout(
    join=timedelta(minutes=10),
    last_call=timedelta(minutes=5),
    close=timedelta(minutes=5),
    keep_alive=timedelta(seconds=30)
)

Intel GPU Not Detected:

# Verify IPEX installation
pip install intel-extension-for-pytorch
python -c "import intel_extension_for_pytorch as ipex; print(ipex.xpu.is_available())"

Performance Issues

Slow DataLoader:

# Set number of workers and pin memory
torch.utils.data.DataLoader(
    dataset, 
    batch_size=32, 
    num_workers=4, 
    pin_memory=True,
    persistent_workers=True
)

MKL-DNN Warnings:

export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

Debugging

Enable detailed logging:

import logging
logging.basicConfig(level=logging.DEBUG)
torch._logging.set_logs(all=logging.DEBUG)

Guard debugging (from _guards.py):

# Check compiled guards
torch._dynamo.config.verbose = True
torch._logging.set_logs(guards=True)

How to Deploy & Use pytorch/pytorch