← Back to cloudberry

How to Deploy & Use cloudberry

Apache Cloudberry Deployment and Usage Guide

1. Prerequisites

System Requirements

  • Operating System: Linux (RHEL/Rocky Linux 8+, Ubuntu 20.04+) or macOS
  • Build Tools: C compiler (GCC), make, autoconf, automake, libtool
  • Dependencies:
    • PostgreSQL development libraries
    • Readline development libraries
    • OpenSSL development libraries
    • Python 3.x (for some build scripts)
  • Memory: Minimum 4GB RAM (8GB+ recommended for development)
  • Disk Space: At least 10GB free space for building from source

Required Software

  • Git for cloning the repository
  • Docker (optional, for sandbox environment)
  • CMake (for some build configurations)
  • Development headers for system libraries

2. Installation

Clone the Repository

git clone https://github.com/apache/cloudberry.git
cd cloudberry

Install Dependencies

For RHEL/Rocky Linux 8+:

sudo yum groupinstall "Development Tools"
sudo yum install readline-devel openssl-devel python3-devel
sudo yum install postgresql-devel

For Ubuntu 20.04+:

sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install libreadline-dev libssl-dev python3-dev
sudo apt-get install libpq-dev

For macOS:

brew install readline openssl python3
brew install postgresql@15

3. Configuration

Environment Variables

Set these environment variables before building:

# For macOS with Homebrew PostgreSQL
export PG_CONFIG=/usr/local/opt/postgresql@15/bin/pg_config

# For Linux systems
export PG_CONFIG=/usr/bin/pg_config

# Optional: Set installation prefix
export PREFIX=/usr/local/cloudberry

Configuration Files

Cloudberry uses several configuration files:

  1. configure script options: Customize build parameters
  2. postgresql.conf: Database configuration (generated during installation)
  3. pg_hba.conf: Client authentication configuration
  4. Environment-specific configurations in devops/sandbox/ for Docker deployments

4. Build & Run

Build from Source

Step 1: Generate configure script

./configure --prefix=$PREFIX

Step 2: Compile the database

make -j$(nproc)

Step 3: Install Cloudberry

make install

Step 4: Initialize the database cluster

# Create data directory
mkdir -p $PREFIX/data

# Initialize database
initdb -D $PREFIX/data --locale=en_US.UTF-8

Quick Start with Docker Sandbox

For rapid testing and evaluation:

cd devops/sandbox
# Build and start the sandbox environment
./sandbox.sh start

The sandbox provides a pre-configured Cloudberry instance with sample data and tools.

Run Locally (Development)

Start the database server:

# Start Cloudberry in the foreground
pg_ctl -D $PREFIX/data -l logfile start

# Or run in background
pg_ctl -D $PREFIX/data start

Connect to Cloudberry:

# Connect using psql
psql -h localhost -p 5432 -U gpadmin postgres

# Default credentials in sandbox:
# Username: gpadmin
# Password: changeme

Production Build Considerations

For production deployments, add these configure options:

./configure \
  --prefix=/opt/cloudberry \
  --enable-debug \
  --enable-cassert \
  --enable-depend \
  --with-perl \
  --with-python \
  --with-openssl \
  --with-libxml \
  --with-uuid=e2fs

5. Deployment

Deployment Platforms

1. Bare Metal/Virtual Machines

  • Suitable for maximum performance and control
  • Requires manual cluster setup for MPP architecture
  • Use configuration management tools (Ansible, Puppet) for multi-node deployment

2. Docker Containers

  • Use the provided Dockerfiles in devops/sandbox/
  • Orchestrate with Docker Compose for single-node deployments
  • For production multi-node: Consider Kubernetes with StatefulSets

3. Kubernetes (Production Recommended)

  • Use Helm charts (check ecosystem repositories)
  • Deploy with persistent volumes for data storage
  • Configure pod anti-affinity for high availability

4. Cloud Providers

  • AWS: Deploy on EC2 instances with EBS volumes
  • Azure: Use Azure VMs with managed disks
  • GCP: Compute Engine with persistent disks
  • Consider managed Kubernetes services (EKS, AKS, GKE) for containerized deployments

Multi-Node MPP Deployment

For true MPP capabilities, deploy across multiple nodes:

# On each segment host (example for 2 segments):
gpseginstall -f hostfile_gpseg

# Initialize cluster
gpinitsystem -c gpconfigs/gpinitsystem_config

# Start cluster
gpstart -a

Create a hostfile (hostfile_gpseg) with all segment hosts:

segment1-host
segment2-host

Backup and Recovery

Use the Cloudberry Backup utility from the ecosystem repository:

# Clone backup utility
git clone https://github.com/apache/cloudberry-backup.git

# Configure and run backups
cloudberry-backup --config backup_config.yaml

6. Troubleshooting

Common Build Issues

Issue: configure: error: readline library not found Solution:

# On Ubuntu/Debian
sudo apt-get install libreadline-dev

# On RHEL/CentOS
sudo yum install readline-devel

Issue: PostgreSQL development headers missing Solution:

# Ensure PostgreSQL is installed with development packages
# Verify pg_config is in PATH
which pg_config

Issue: Memory exhaustion during compilation Solution:

# Reduce parallel jobs
make -j2  # Instead of make -j$(nproc)
# Or increase swap space
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Runtime Issues

Issue: Cannot start database - port already in use Solution:

# Check for existing PostgreSQL instances
netstat -tlnp | grep 5432
# Change port in postgresql.conf or stop existing instance

Issue: Authentication failures Solution: Modify pg_hba.conf in data directory:

# Add line for local connections
host    all             all             127.0.0.1/32            md5

Issue: Performance problems in MPP setup Solution:

  1. Check segment synchronization: gpstate -s
  2. Verify network connectivity between nodes
  3. Review postgresql.conf settings for memory allocation
  4. Check disk I/O performance on segment hosts

Getting Help

  1. Check Logs: Examine log files in $PREFIX/data/pg_log/
  2. Community Support:
  3. Documentation: https://cloudberry.apache.org/docs/
  4. Report Bugs: https://github.com/apache/cloudberry/issues

Performance Tuning Tips

  • Adjust shared_buffers based on available RAM
  • Configure work_mem for complex queries
  • Set appropriate maintenance_work_mem for vacuum operations
  • Enable query planner statistics with track_counts = on
  • Consider partitioning large tables for better parallel processing