PyTorch Deployment and Usage Guide
Prerequisites
System Requirements
- Python: 3.8–3.12 (64-bit)
- Operating System: Linux (recommended), macOS, Windows
- Git: For source builds
- C++ Compiler: GCC 9.4+ or Clang (source builds only)
GPU Acceleration (Optional)
- NVIDIA: CUDA 11.8 or 12.1 compatible drivers (450.80.02+ for CUDA 11.8, 525.60.13+ for CUDA 12.1)
- AMD: ROCm 5.7+ (Linux only)
- Intel: Intel Extension for PyTorch (IPEX) for Intel GPUs
Build Dependencies (Source Installation)
# Ubuntu/Debian
sudo apt-get install python3-dev python3-pip python3-venv git cmake ninja-build
# macOS
brew install cmake ninja pkg-config
# Windows
# Install Visual Studio 2019 or newer with C++ build tools
Installation
Method 1: Pre-built Binaries (Recommended)
Via pip (with CUDA 12.1):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Via pip (CPU only):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Via conda:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
NVIDIA Jetson (ARM64):
# Use NVIDIA JetPack SDK pre-built wheels
pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cu121
Method 2: Build from Source
- Clone repository (recursive for submodules):
git clone --recursive https://github.com/pytorch/pytorch.git
cd pytorch
git submodule update --init --recursive
- Install build dependencies:
pip install -r requirements.txt
# Or for development:
pip install -e .
- Configure build (optional):
export USE_CUDA=1
export USE_CUDNN=1
export USE_MKLDNN=1
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
- Build and install:
python setup.py install
# Or for development (editable install):
python setup.py develop
Method 3: Docker
Pre-built images:
# CUDA 12.1 runtime
docker run --gpus all -it --rm pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
# Development image
docker run --gpus all -it --rm -v $(pwd):/workspace pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
Build custom image:
docker build -t pytorch-custom -f Dockerfile .
Configuration
Environment Variables
GPU Configuration:
# Limit visible GPUs
export CUDA_VISIBLE_DEVICES=0,1
# Memory management
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# CUDA deterministic operations (reproducibility)
export CUBLAS_WORKSPACE_CONFIG=:4096:8
Distributed Training (from torch.distributed.elastic):
# Rendezvous backend configuration
export RDZV_BACKEND=etcd
export RDZV_ENDPOINT=etcd.example.com:2379
export RDZV_ID=job123
export TORCHELASTIC_MAX_RESTARTS=3
Performance Tuning:
# Intel MKL threads
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
# Disable MKL-DNN if needed
export USE_MKLDNN=0
Configuration Files
Distributed Rendezvous (dynamic_rendezvous.py):
Create rdzv_config.yaml:
backend: etcd
endpoint: localhost:2379
timeout: 300
protocol: http
Build & Run
Local Development Build
Fast development build (no optimizations):
DEBUG=1 USE_DISTRIBUTED=0 USE_MKLDNN=0 USE_CUDA=0 python setup.py develop
Full optimized build:
USE_CUDA=1 USE_CUDNN=1 USE_MKLDNN=1 python setup.py install
Verification
Test installation:
import torch
# Check version and CUDA availability
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Device count: {torch.cuda.device_count()}")
# Test tensor operations
x = torch.rand(5, 3)
if torch.cuda.is_available():
x = x.cuda()
print(f"Tensor on {x.device}")
Running Tests
# Install test dependencies
pip install pytest pytest-xdist hypothesis
# Run core tests
python -m pytest test/test_torch.py -v -x
# Run distributed tests (requires multiple GPUs)
python -m pytest test/distributed/test_c10d_common.py -v
# Run specific module tests (e.g., profiler)
python -m pytest test/profiler/test_memory_profiler.py -v
Distributed Training Launch
Single-node multi-GPU:
python -m torch.distributed.launch \
--nproc_per_node=4 \
--nnodes=1 \
--master_port=29500 \
train.py
Elastic launch (with DynamicRendezvousHandler):
torchrun \
--nnodes=2:4 \
--nproc_per_node=8 \
--rdzv_id=job123 \
--rdzv_backend=etcd \
--rdzv_endpoint=etcd.example.com:2379 \
train.py
Deployment
Cloud Platforms
AWS Deep Learning AMI:
# Use pre-configured PyTorch environments
source activate pytorch
Google Cloud AI Platform:
gcloud ai-platform jobs submit training $JOB_NAME \
--region $REGION \
--master-image-uri gcr.io/cloud-ml-images/pytorch:latest \
--scale-tier CUSTOM \
--master-machine-type n1-standard-8 \
--master-accelerator count=1,type=nvidia-tesla-t4
Azure Machine Learning:
from azureml.core import Environment
env = Environment.from_pip_requirements(name='pytorch-env', file_path='requirements.txt')
env.docker.enabled = True
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04'
Container Orchestration
Kubernetes:
apiVersion: v1
kind: Pod
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
resources:
limits:
nvidia.com/gpu: 1
command: ["python", "train.py"]
Docker Compose (multi-node simulation):
version: '3.8'
services:
pytorch-master:
image: pytorch/pytorch:latest
runtime: nvidia
environment:
- MASTER_ADDR=pytorch-master
- MASTER_PORT=29500
- WORLD_SIZE=2
- RANK=0
command: python train.py
pytorch-worker:
image: pytorch/pytorch:latest
runtime: nvidia
environment:
- MASTER_ADDR=pytorch-master
- MASTER_PORT=29500
- WORLD_SIZE=2
- RANK=1
command: python train.py
Edge Deployment (NVIDIA Jetson)
# On Jetson device with JetPack 5.1+
sudo apt-get install python3-pip libopenblas-base libopenmpi-dev libomp-dev
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
Model Serving (TorchServe)
# Install TorchServe
pip install torchserve torch-model-archiver
# Archive model
torch-model-archiver --model-name my_model --version 1.0 --model-file model.py --serialized-file model.pth --handler custom_handler.py
# Start server
torchserve --start --model-store model_store --models my_model=my_model.mar
Troubleshooting
Build Issues
Error: CMake Error: Could not find CMAKE_ROOT:
pip install cmake --upgrade
export CMAKE_ROOT=$(python -c "import cmake; print(cmake.__path__[0])")
Error: nvcc fatal : Unsupported gpu architecture 'compute_86':
# Set specific CUDA architectures for your GPU
export TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0"
python setup.py install
Error: undefined symbol: _ZN3c106detail...:
# Clean build required
python setup.py clean
rm -rf build
python setup.py install
Runtime Issues
CUDA Out of Memory (from _memory_profiler.py usage):
# Enable memory profiling to diagnose
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True) as prof:
# your code
print(prof.key_averages().table(sort_by="cuda_memory_usage"))
Solution:
# Empty cache between iterations
torch.cuda.empty_cache()
# Or set environment variable
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Distributed Rendezvous Timeout:
# Increase timeout in handler
from torch.distributed.elastic.rendezvous import RendezvousTimeout
timeout = RendezvousTimeout(
join=timedelta(minutes=10),
last_call=timedelta(minutes=5),
close=timedelta(minutes=5),
keep_alive=timedelta(seconds=30)
)
Intel GPU Not Detected:
# Verify IPEX installation
pip install intel-extension-for-pytorch
python -c "import intel_extension_for_pytorch as ipex; print(ipex.xpu.is_available())"
Performance Issues
Slow DataLoader:
# Set number of workers and pin memory
torch.utils.data.DataLoader(
dataset,
batch_size=32,
num_workers=4,
pin_memory=True,
persistent_workers=True
)
MKL-DNN Warnings:
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
Debugging
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)
torch._logging.set_logs(all=logging.DEBUG)
Guard debugging (from _guards.py):
# Check compiled guards
torch._dynamo.config.verbose = True
torch._logging.set_logs(guards=True)