Apache Spark Deployment & Usage Guide
1. Prerequisites
System Requirements
- Java: OpenJDK 17 or higher (Temurin/Adoptium recommended)
- Memory: Minimum 8GB RAM (16GB+ recommended for local development)
- Disk: 20GB+ free space for builds and dependencies
Build Tools (Choose One)
- Maven: 3.9.6 or higher
- SBT: 1.9.0 or higher (for Scala development)
Optional Dependencies
- Python: 3.9, 3.10, 3.11, or 3.12 (for PySpark)
- R: 4.0+ (for SparkR, deprecated)
- Hadoop Client Libraries: 3.3.6+ (if connecting to HDFS/YARN)
Platform Support
- Linux (primary)
- macOS (Intel/Apple Silicon)
- Windows (via WSL2 recommended)
2. Installation
Option A: Pre-built Binaries (Recommended for Users)
# Download latest stable release
wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xzf spark-3.5.0-bin-hadoop3.tgz
cd spark-3.5.0-bin-hadoop3
Option B: Build from Source (Recommended for Developers)
# Clone repository
git clone https://github.com/apache/spark.git
cd spark
git checkout v3.5.0 # or master for development
# Build with Maven (Scala 2.13 default)
./build/mvn -DskipTests clean package
# Or build with SBT
./build/sbt -Phadoop-3 -Dhadoop.version=3.3.6 clean package
# Build with specific Scala version (2.12 or 2.13)
./build/mvn -Dscala-2.13 -DskipTests clean package
Python Environment (PySpark)
# Install from PyPI
pip install pyspark==3.5.0
# Or install from source
cd python
pip install -e .
3. Configuration
Environment Variables
Add to ~/.bashrc or ~/.zshrc:
export SPARK_HOME=/path/to/spark
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
# Python configuration (for PySpark)
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
# Java options
export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=8g
export SPARK_WORKER_INSTANCES=1
Core Configuration Files
$SPARK_HOME/conf/spark-defaults.conf:
spark.master spark://localhost:7077
spark.driver.memory 4g
spark.executor.memory 4g
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.warehouse.dir /user/hive/warehouse
$SPARK_HOME/conf/spark-env.sh (copy from template):
export JAVA_HOME=/path/to/java
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_PORT=8888
export SPARK_WORKER_WEBUI_PORT=8081
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=/tmp/spark-events"
$SPARK_HOME/conf/log4j2.properties:
rootLogger.level = WARN
logger.spark.name = org.apache.spark
logger.spark.level = INFO
Hadoop Integration (Optional)
If using HDFS/YARN, ensure core-site.xml and hdfs-site.xml are in $SPARK_HOME/conf/ or $HADOOP_CONF_DIR.
4. Build & Run
Local Development Mode
Start Spark Shell (Scala):
./bin/spark-shell --master local[4] --driver-memory 4g
Start PySpark:
./bin/pyspark --master local[4] --driver-memory 4g
Submit Application:
# Scala/Java application
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[*] \
--executor-memory 2g \
examples/jars/spark-examples_2.13-3.5.0.jar \
100
# Python application
./bin/spark-submit \
--master local[*] \
--py-files dependencies.zip \
script.py
Running Tests
# Run all tests (takes several hours)
./build/mvn test
# Run specific module tests
./build/mvn test -pl core -Dsuites=org.apache.spark.SparkContextSuite
# Run Python tests
python/run-tests.py --module pyspark-sql
Production Build
# Create distribution package
./dev/make-distribution.sh \
--name custom-spark \
--pip \
--tgz \
-Phadoop-3.3 \
-Dhadoop.version=3.3.6 \
-Phive -Phive-thriftserver \
-Pyarn \
-Pkubernetes
5. Deployment
Standalone Cluster Mode
Start Master:
./sbin/start-master.sh
# Access Web UI: http://localhost:8080
Start Workers:
./sbin/start-worker.sh spark://localhost:7077 -c 4 -m 8g
Submit to Cluster:
./bin/spark-submit \
--master spark://localhost:7077 \
--deploy-mode cluster \
--executor-cores 4 \
--executor-memory 8g \
application.jar
Kubernetes Deployment
# Build Docker image
./bin/docker-image-tool.sh -t my-spark -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
# Submit to K8s
./bin/spark-submit \
--master k8s://https://<k8s-api-server>:443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=my-spark:latest \
local:///opt/spark/examples/jars/spark-examples_2.13-3.5.0.jar
YARN Deployment (Hadoop)
./bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 4g \
--executor-cores 4 \
--num-executors 10 \
application.jar
Cloud Platforms
AWS EMR:
aws emr create-cluster \
--release-label emr-6.15.0 \
--applications Name=Spark \
--instance-type m5.xlarge \
--instance-count 3
Azure HDInsight: Use Azure Portal or CLI to create Spark 3.5 cluster with HDInsight 5.1.
Google Cloud Dataproc:
gcloud dataproc clusters create spark-cluster \
--image-version 2.1 \
--region us-central1 \
--master-machine-type n1-standard-4 \
--worker-machine-type n1-standard-4 \
--num-workers 2
6. Troubleshooting
Build Issues
Error: java.lang.UnsupportedClassVersionError
- Solution: Ensure JAVA_HOME points to Java 17+. Check with
java -version.
Error: Maven Out of Memory
- Solution: Increase heap:
export MAVEN_OPTS="-Xmx4g -XX:ReservedCodeCacheSize=2g"
Error: SBT download fails
- Solution: Set proxy in
build/sbtconfig.txtor usebuild/sbt -Dsbt.repository.config=...
Runtime Issues
Error: Address already in use: 4040
- Solution: Spark UI port conflict. Change port:
--conf spark.ui.port=4041
Error: No space left on device during shuffle
- Solution: Set local directories:
spark.local.dir=/mnt/spark-tmp(comma-separated for multiple disks)
Error: Python worker failed to connect back
- Solution: Check PYSPARK_PYTHON path. Ensure Python versions match on driver and workers.
Error: Kryo serialization failed
- Solution: Register custom classes:
--conf spark.kryo.classesToRegister=org.example.MyClass
Performance Issues
Symptom: Slow shuffle operations
- Fix: Enable adaptive query execution:
spark.sql.adaptive.enabled=true spark.sql.adaptive.coalescePartitions.enabled=true spark.sql.adaptive.skewJoin.enabled=true
Symptom: Out of Memory on Executors
- Fix: Reduce
spark.executor.memoryor decreasespark.executor.cores. Enable off-heap:spark.memory.offHeap.enabled=true spark.memory.offHeap.size=4g
Connection Issues
Error: Connection refused: spark-master/7077
- Check: Ensure
spark.masterURL matchesSPARK_MASTER_PORTin spark-env.sh - Verify: Master Web UI accessible and worker logs show successful registration
Error: HDFS DataNode not responding
- Fix: Copy Hadoop configuration files to
$SPARK_HOME/conf/and setHADOOP_CONF_DIR.