Apache Cloudberry Deployment and Usage Guide

1. Prerequisites

System Requirements

Operating System: Linux (RHEL/Rocky Linux 8+, Ubuntu 20.04+) or macOS
Build Tools: C compiler (GCC), make, autoconf, automake, libtool
Dependencies:
- PostgreSQL development libraries
- Readline development libraries
- OpenSSL development libraries
- Python 3.x (for some build scripts)
Memory: Minimum 4GB RAM (8GB+ recommended for development)
Disk Space: At least 10GB free space for building from source

Required Software

Git for cloning the repository
Docker (optional, for sandbox environment)
CMake (for some build configurations)
Development headers for system libraries

2. Installation

Clone the Repository

git clone https://github.com/apache/cloudberry.git
cd cloudberry

Install Dependencies

For RHEL/Rocky Linux 8+:

sudo yum groupinstall "Development Tools"
sudo yum install readline-devel openssl-devel python3-devel
sudo yum install postgresql-devel

For Ubuntu 20.04+:

sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install libreadline-dev libssl-dev python3-dev
sudo apt-get install libpq-dev

For macOS:

brew install readline openssl python3
brew install postgresql@15

3. Configuration

Environment Variables

Set these environment variables before building:

# For macOS with Homebrew PostgreSQL
export PG_CONFIG=/usr/local/opt/postgresql@15/bin/pg_config

# For Linux systems
export PG_CONFIG=/usr/bin/pg_config

# Optional: Set installation prefix
export PREFIX=/usr/local/cloudberry

Configuration Files

Cloudberry uses several configuration files:

configure script options: Customize build parameters
postgresql.conf: Database configuration (generated during installation)
pg_hba.conf: Client authentication configuration
Environment-specific configurations in devops/sandbox/ for Docker deployments

4. Build & Run

Build from Source

Step 1: Generate configure script

./configure --prefix=$PREFIX

Step 2: Compile the database

make -j$(nproc)

Step 3: Install Cloudberry

make install

Step 4: Initialize the database cluster

# Create data directory
mkdir -p $PREFIX/data

# Initialize database
initdb -D $PREFIX/data --locale=en_US.UTF-8

Quick Start with Docker Sandbox

For rapid testing and evaluation:

cd devops/sandbox
# Build and start the sandbox environment
./sandbox.sh start

The sandbox provides a pre-configured Cloudberry instance with sample data and tools.

Run Locally (Development)

Start the database server:

# Start Cloudberry in the foreground
pg_ctl -D $PREFIX/data -l logfile start

# Or run in background
pg_ctl -D $PREFIX/data start

Connect to Cloudberry:

# Connect using psql
psql -h localhost -p 5432 -U gpadmin postgres

# Default credentials in sandbox:
# Username: gpadmin
# Password: changeme

Production Build Considerations

For production deployments, add these configure options:

./configure \
  --prefix=/opt/cloudberry \
  --enable-debug \
  --enable-cassert \
  --enable-depend \
  --with-perl \
  --with-python \
  --with-openssl \
  --with-libxml \
  --with-uuid=e2fs

5. Deployment

Deployment Platforms

1. Bare Metal/Virtual Machines

Suitable for maximum performance and control
Requires manual cluster setup for MPP architecture
Use configuration management tools (Ansible, Puppet) for multi-node deployment

2. Docker Containers

Use the provided Dockerfiles in devops/sandbox/
Orchestrate with Docker Compose for single-node deployments
For production multi-node: Consider Kubernetes with StatefulSets

3. Kubernetes (Production Recommended)

Use Helm charts (check ecosystem repositories)
Deploy with persistent volumes for data storage
Configure pod anti-affinity for high availability

4. Cloud Providers

AWS: Deploy on EC2 instances with EBS volumes
Azure: Use Azure VMs with managed disks
GCP: Compute Engine with persistent disks
Consider managed Kubernetes services (EKS, AKS, GKE) for containerized deployments

Multi-Node MPP Deployment

For true MPP capabilities, deploy across multiple nodes:

# On each segment host (example for 2 segments):
gpseginstall -f hostfile_gpseg

# Initialize cluster
gpinitsystem -c gpconfigs/gpinitsystem_config

# Start cluster
gpstart -a

Create a hostfile (hostfile_gpseg) with all segment hosts:

segment1-host
segment2-host

Backup and Recovery

Use the Cloudberry Backup utility from the ecosystem repository:

# Clone backup utility
git clone https://github.com/apache/cloudberry-backup.git

# Configure and run backups
cloudberry-backup --config backup_config.yaml

6. Troubleshooting

Common Build Issues

Issue: configure: error: readline library not found Solution:

# On Ubuntu/Debian
sudo apt-get install libreadline-dev

# On RHEL/CentOS
sudo yum install readline-devel

Issue: PostgreSQL development headers missing Solution:

# Ensure PostgreSQL is installed with development packages
# Verify pg_config is in PATH
which pg_config

Issue: Memory exhaustion during compilation Solution:

# Reduce parallel jobs
make -j2  # Instead of make -j$(nproc)
# Or increase swap space
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Runtime Issues

Issue: Cannot start database - port already in use Solution:

# Check for existing PostgreSQL instances
netstat -tlnp | grep 5432
# Change port in postgresql.conf or stop existing instance

Issue: Authentication failures Solution: Modify pg_hba.conf in data directory:

# Add line for local connections
host    all             all             127.0.0.1/32            md5

Issue: Performance problems in MPP setup Solution:

Check segment synchronization: gpstate -s
Verify network connectivity between nodes
Review postgresql.conf settings for memory allocation
Check disk I/O performance on segment hosts

Getting Help

Check Logs: Examine log files in $PREFIX/data/pg_log/
Community Support:
- Join Slack: https://inviter.co/apache-cloudberry
- GitHub Discussions: https://github.com/apache/cloudberry/discussions
- Q&A Forum: https://github.com/apache/cloudberry/discussions/categories/q-a
Documentation: https://cloudberry.apache.org/docs/
Report Bugs: https://github.com/apache/cloudberry/issues

Performance Tuning Tips

Adjust shared_buffers based on available RAM
Configure work_mem for complex queries
Set appropriate maintenance_work_mem for vacuum operations
Enable query planner statistics with track_counts = on
Consider partitioning large tables for better parallel processing

How to Deploy & Use cloudberry