Apache Cloudberry Deployment and Usage Guide
1. Prerequisites
System Requirements
- Operating System: Linux (RHEL/Rocky Linux 8+, Ubuntu 20.04+) or macOS
- Build Tools: C compiler (GCC), make, autoconf, automake, libtool
- Dependencies:
- PostgreSQL development libraries
- Readline development libraries
- OpenSSL development libraries
- Python 3.x (for some build scripts)
- Memory: Minimum 4GB RAM (8GB+ recommended for development)
- Disk Space: At least 10GB free space for building from source
Required Software
- Git for cloning the repository
- Docker (optional, for sandbox environment)
- CMake (for some build configurations)
- Development headers for system libraries
2. Installation
Clone the Repository
git clone https://github.com/apache/cloudberry.git
cd cloudberry
Install Dependencies
For RHEL/Rocky Linux 8+:
sudo yum groupinstall "Development Tools"
sudo yum install readline-devel openssl-devel python3-devel
sudo yum install postgresql-devel
For Ubuntu 20.04+:
sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install libreadline-dev libssl-dev python3-dev
sudo apt-get install libpq-dev
For macOS:
brew install readline openssl python3
brew install postgresql@15
3. Configuration
Environment Variables
Set these environment variables before building:
# For macOS with Homebrew PostgreSQL
export PG_CONFIG=/usr/local/opt/postgresql@15/bin/pg_config
# For Linux systems
export PG_CONFIG=/usr/bin/pg_config
# Optional: Set installation prefix
export PREFIX=/usr/local/cloudberry
Configuration Files
Cloudberry uses several configuration files:
configurescript options: Customize build parameterspostgresql.conf: Database configuration (generated during installation)pg_hba.conf: Client authentication configuration- Environment-specific configurations in
devops/sandbox/for Docker deployments
4. Build & Run
Build from Source
Step 1: Generate configure script
./configure --prefix=$PREFIX
Step 2: Compile the database
make -j$(nproc)
Step 3: Install Cloudberry
make install
Step 4: Initialize the database cluster
# Create data directory
mkdir -p $PREFIX/data
# Initialize database
initdb -D $PREFIX/data --locale=en_US.UTF-8
Quick Start with Docker Sandbox
For rapid testing and evaluation:
cd devops/sandbox
# Build and start the sandbox environment
./sandbox.sh start
The sandbox provides a pre-configured Cloudberry instance with sample data and tools.
Run Locally (Development)
Start the database server:
# Start Cloudberry in the foreground
pg_ctl -D $PREFIX/data -l logfile start
# Or run in background
pg_ctl -D $PREFIX/data start
Connect to Cloudberry:
# Connect using psql
psql -h localhost -p 5432 -U gpadmin postgres
# Default credentials in sandbox:
# Username: gpadmin
# Password: changeme
Production Build Considerations
For production deployments, add these configure options:
./configure \
--prefix=/opt/cloudberry \
--enable-debug \
--enable-cassert \
--enable-depend \
--with-perl \
--with-python \
--with-openssl \
--with-libxml \
--with-uuid=e2fs
5. Deployment
Deployment Platforms
1. Bare Metal/Virtual Machines
- Suitable for maximum performance and control
- Requires manual cluster setup for MPP architecture
- Use configuration management tools (Ansible, Puppet) for multi-node deployment
2. Docker Containers
- Use the provided Dockerfiles in
devops/sandbox/ - Orchestrate with Docker Compose for single-node deployments
- For production multi-node: Consider Kubernetes with StatefulSets
3. Kubernetes (Production Recommended)
- Use Helm charts (check ecosystem repositories)
- Deploy with persistent volumes for data storage
- Configure pod anti-affinity for high availability
4. Cloud Providers
- AWS: Deploy on EC2 instances with EBS volumes
- Azure: Use Azure VMs with managed disks
- GCP: Compute Engine with persistent disks
- Consider managed Kubernetes services (EKS, AKS, GKE) for containerized deployments
Multi-Node MPP Deployment
For true MPP capabilities, deploy across multiple nodes:
# On each segment host (example for 2 segments):
gpseginstall -f hostfile_gpseg
# Initialize cluster
gpinitsystem -c gpconfigs/gpinitsystem_config
# Start cluster
gpstart -a
Create a hostfile (hostfile_gpseg) with all segment hosts:
segment1-host
segment2-host
Backup and Recovery
Use the Cloudberry Backup utility from the ecosystem repository:
# Clone backup utility
git clone https://github.com/apache/cloudberry-backup.git
# Configure and run backups
cloudberry-backup --config backup_config.yaml
6. Troubleshooting
Common Build Issues
Issue: configure: error: readline library not found
Solution:
# On Ubuntu/Debian
sudo apt-get install libreadline-dev
# On RHEL/CentOS
sudo yum install readline-devel
Issue: PostgreSQL development headers missing Solution:
# Ensure PostgreSQL is installed with development packages
# Verify pg_config is in PATH
which pg_config
Issue: Memory exhaustion during compilation Solution:
# Reduce parallel jobs
make -j2 # Instead of make -j$(nproc)
# Or increase swap space
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Runtime Issues
Issue: Cannot start database - port already in use Solution:
# Check for existing PostgreSQL instances
netstat -tlnp | grep 5432
# Change port in postgresql.conf or stop existing instance
Issue: Authentication failures
Solution: Modify pg_hba.conf in data directory:
# Add line for local connections
host all all 127.0.0.1/32 md5
Issue: Performance problems in MPP setup Solution:
- Check segment synchronization:
gpstate -s - Verify network connectivity between nodes
- Review
postgresql.confsettings for memory allocation - Check disk I/O performance on segment hosts
Getting Help
- Check Logs: Examine log files in
$PREFIX/data/pg_log/ - Community Support:
- Join Slack: https://inviter.co/apache-cloudberry
- GitHub Discussions: https://github.com/apache/cloudberry/discussions
- Q&A Forum: https://github.com/apache/cloudberry/discussions/categories/q-a
- Documentation: https://cloudberry.apache.org/docs/
- Report Bugs: https://github.com/apache/cloudberry/issues
Performance Tuning Tips
- Adjust
shared_buffersbased on available RAM - Configure
work_memfor complex queries - Set appropriate
maintenance_work_memfor vacuum operations - Enable query planner statistics with
track_counts = on - Consider partitioning large tables for better parallel processing