← Back to pandas-dev/pandas

How to Deploy & Use pandas-dev/pandas

Pandas Deployment and Usage Guide

1. Prerequisites

Runtime Requirements

  • Python: Version 3.9 or higher (64-bit recommended)
  • Operating System: Linux, macOS, or Windows
  • Memory: Minimum 4GB RAM (8GB+ recommended for large datasets)

Build Requirements (For Source Installation)

  • C Compiler:
    • Linux: GCC 9+ or Clang 10+
    • macOS: Xcode Command Line Tools
    • Windows: Microsoft Visual C++ 14.0 or greater
  • Git: For cloning the repository
  • Python Development Headers: Usually included with Python installation

Optional Dependencies

  • Numba: For JIT compilation in window operations (engine='numba')
  • PyArrow: For Parquet/Feather I/O and Arrow-backed dtypes
  • fsspec: For cloud storage access (S3, GCS, Azure)
  • SQLAlchemy: For database connectivity
  • openpyxl/xlrd: For Excel file support
  • matplotlib: For plotting integration

2. Installation

Standard Installation (Users)

Using pip:

pip install pandas

Using conda (Recommended for data science):

conda install -c conda-forge pandas

With optional dependencies:

# Full-featured installation
pip install pandas[performance,computation,filesystem,aws,gcp,excel,parquet]

# Specific use cases
pip install pandas pyarrow numba fsspec s3fs openpyxl

Development Installation (From Source)

  1. Clone the repository:
git clone https://github.com/pandas-dev/pandas.git
cd pandas
  1. Create isolated environment:
python -m venv venv
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate  # Windows
  1. Install build dependencies:
pip install meson-python numpy Cython versioneer[toml]
  1. Build and install in editable mode:
pip install -e . --no-build-isolation -v
  1. Verify installation:
python -c "import pandas; print(pandas.__version__)"

3. Configuration

Environment Variables

# Set default display options
export PANDAS_OPTIONS="display.max_rows=100,display.max_columns=50"

# Numba cache directory (for window operations)
export NUMBA_CACHE_DIR="/tmp/numba_cache"

# Threading control
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4

Runtime Configuration

Configure behavior programmatically:

import pandas as pd

# Display options
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)
pd.set_option('mode.copy_on_write', True)  # Recommended for pandas 2.0+

# Performance options
pd.set_option('compute.use_numba', True)  # Use numba for eligible operations

IO Configuration

For cloud storage and advanced IO, configure fsspec:

import pandas as pd

# S3 access
df = pd.read_csv('s3://bucket/file.csv', storage_options={
    'key': 'YOUR_KEY',
    'secret': 'YOUR_SECRET'
})

# Or use environment variables for credentials
# AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

4. Build & Run

Building Extensions

If installing from source with custom optimizations:

# Set compiler flags for optimization
export CFLAGS="-O3 -march=native"
pip install -e . --no-build-isolation -v

Running Tests

# Install test dependencies
pip install pytest pytest-xdist hypothesis

# Run full test suite
pytest pandas/tests -x -v --no-header

# Run specific module tests
pytest pandas/tests/io/test_csv.py -v

# Parallel testing (recommended)
pytest pandas/tests -n auto --dist=loadfile

Basic Usage Verification

Create a test script verify_pandas.py:

import pandas as pd
import numpy as np

# Test basic functionality
df = pd.DataFrame({
    'A': range(5),
    'B': np.random.randn(5),
    'C': pd.date_range('2024-01-01', periods=5)
})

print("DataFrame creation: OK")
print(df)

# Test IO
df.to_parquet('/tmp/test.parquet')
df_read = pd.read_parquet('/tmp/test.parquet')
assert df.equals(df_read)
print("Parquet IO: OK")

# Test window operations (uses C extensions)
df['expanding_mean'] = df['B'].expanding().mean()
print("Window operations: OK")

# Test groupby (uses Cython extensions)
result = df.groupby('A').sum()
print("Groupby operations: OK")

print("\nAll systems operational!")

Run with:

python verify_pandas.py

5. Deployment

Container Deployment (Docker)

Dockerfile for data processing application:

FROM python:3.11-slim

# Install system dependencies for pandas and numpy
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    libopenblas-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install pandas with performance optimizations
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Application code
COPY . .

CMD ["python", "app.py"]

requirements.txt:

pandas>=2.0.0
pyarrow>=12.0.0
numba>=0.57.0
fsspec>=2023.0.0

Cloud Function Deployment

AWS Lambda Layer:

# Create layer package
mkdir -p python/lib/python3.11/site-packages
pip install pandas -t python/lib/python3.11/site-packages
zip -r pandas_layer.zip python

Google Cloud Functions:

gcloud functions deploy process_data \
    --runtime python311 \
    --trigger-http \
    --memory 2048MB \
    --requirements-file requirements.txt

Production Environment Setup

Systemd service for data pipeline:

[Unit]
Description=Pandas Data Pipeline
After=network.target

[Service]
Type=simple
User=datauser
WorkingDirectory=/opt/pipeline
Environment="PANDAS_OPTIONS=display.max_rows=10"
Environment="OMP_NUM_THREADS=4"
ExecStart=/opt/pipeline/venv/bin/python pipeline.py
Restart=always

[Install]
WantedBy=multi-user.target

Freezing Applications

PyInstaller spec for standalone executable:

# hook-pandas.py
from PyInstaller.utils.hooks import collect_data_files, collect_submodules

hiddenimports = collect_submodules('pandas')
datas = collect_data_files('pandas')

Build command:

pyinstaller --hidden-import pandas._libs.tslibs.np_datetime \
            --hidden-import pandas._libs.tslibs.nattype \
            --collect-data pandas \
            app.py

6. Troubleshooting

Installation Issues

Problem: ImportError: DLL load failed (Windows)

  • Solution: Install Visual C++ Redistributable 2015-2022
  • Alternative: Use conda instead of pip

Problem: error: Microsoft Visual C++ 14.0 is required (Windows)

  • Solution: Install Build Tools for Visual Studio 2019 or 2022 with "Desktop development with C++" workload

Problem: No module named 'pandas._libs' after installation

  • Solution: Installation was interrupted. Clean and reinstall:
pip uninstall pandas -y
pip cache purge
pip install pandas --force-reinstall --no-cache-dir

Build From Source Failures

Problem: meson-python build errors

  • Solution: Ensure build dependencies are installed:
pip install meson-python numpy Cython versioneer[toml]
pip install -e . --no-build-isolation

Problem: Compilation errors with C extensions

  • Solution: Check compiler version compatibility:
gcc --version  # Should be 9+
# Update pip/setuptools
pip install --upgrade pip setuptools wheel

Runtime Performance Issues

Problem: Slow I/O operations

  • Solution: Install PyArrow backend:
# Use PyArrow engine for CSV
pd.read_csv('large_file.csv', engine='pyarrow')

# Use PyArrow dtypes
pd.read_csv('file.csv', dtype_backend='pyarrow')

Problem: Memory errors with large DataFrames

  • Solution: Enable copy-on-write and use chunked processing:
pd.set_option('mode.copy_on_write', True)

# Chunked reading
chunks = pd.read_csv('huge_file.csv', chunksize=100000)
for chunk in chunks:
    process(chunk)

Problem: Numba engine not available

  • Solution:
pip install numba

Then verify:

df.rolling(window=10).sum(engine='numba')

Version Conflicts

Problem: AttributeError: module 'pandas' has no attribute '...'

  • Solution: Check for namespace conflicts:
pip list | grep pandas
# Remove any packages named 'pandas' in local directory
# Ensure no file named pandas.py exists in working directory

Problem: Numpy compatibility warnings

  • Solution: Ensure compatible versions:
pip install --upgrade pandas numpy

Debugging Import Errors

Enable verbose import to trace C extension loading:

python -v -c "import pandas" 2>&1 | grep -i error

Check compiled extensions exist:

import pandas._libs.lib as lib
print(lib.__file__)  # Should point to .so or .pyd file