Luigi Deployment and Usage Guide
Prerequisites
- Python: Version 2.7 or 3.4+ (Luigi supports both Python 2 and 3)
- Hadoop: For Hadoop-related tasks (optional, but recommended for big data workflows)
- boto3: Required for S3 functionality
- google-api-python-client: Required for BigQuery functionality
- ujson: Optional, for faster JSON parsing (falls back to standard json if not available)
- dateutil: Required for date/time operations in range tools
Installation
-
Clone the repository:
git clone https://github.com/spotify/luigi.git cd luigi -
Install dependencies:
pip install -r requirements.txt -
Install Luigi:
pip install . -
Optional dependencies for specific features:
- For S3 support:
pip install boto3 - For BigQuery support:
pip install google-api-python-client - For faster JSON parsing:
pip install ujson
- For S3 support:
Configuration
Environment Variables
Luigi uses a configuration file system. Create a luigi.cfg file in your project directory:
[core]
default-scheduler-host: localhost
default-scheduler-port: 8082
[hadoop]
pool: default
[s3]
aws_access_key_id: YOUR_AWS_ACCESS_KEY
aws_secret_access_key: YOUR_AWS_SECRET_KEY
Configuration Parameters
- Hadoop Pool: Set the Hadoop pool using the
poolparameter in thehadoopsection - S3 Credentials: Configure AWS credentials in the
s3section - Scheduler: Configure scheduler host and port in the
coresection
Build & Run
Running Locally (Development)
-
Start the Luigi scheduler:
luigid -
Run a Luigi task:
luigi --module my_module MyTask --local-scheduler -
Run with Hadoop:
luigi --module my_module MyHadoopTask --hadoop-home /path/to/hadoop
Running in Production
-
Start the Luigi scheduler as a service:
luigid --background --logdir /var/log/luigi --state-path /var/lib/luigi/state.pickle -
Run tasks with a central scheduler:
luigi --module my_module MyTask --scheduler-host localhost --scheduler-port 8082
Deployment
Platform Recommendations
- Local Development: Run directly on your development machine
- Production: Deploy on a server or cluster with sufficient resources
- Cloud: Deploy on cloud platforms like AWS, GCP, or Azure
- Containerized: Use Docker for consistent deployment across environments
Deployment Steps
-
Package your Luigi tasks:
python setup.py sdist -
Deploy to your target environment:
- For cloud platforms: Use your preferred deployment method (e.g., AWS Elastic Beanstalk, Google App Engine)
- For containerized environments: Build a Docker image and deploy to your container orchestration platform
-
Configure the Luigi scheduler:
- Set up the scheduler to run as a service
- Configure logging and state persistence
Troubleshooting
Common Issues and Solutions
-
ImportError: No module named 'boto3'
- Solution: Install boto3 using
pip install boto3
- Solution: Install boto3 using
-
ImportError: No module named 'googleapiclient'
- Solution: Install google-api-python-client using
pip install google-api-python-client
- Solution: Install google-api-python-client using
-
Hadoop tasks failing
- Solution: Ensure Hadoop is properly installed and configured, and that the
hadoopsection in your config file is set up correctly
- Solution: Ensure Hadoop is properly installed and configured, and that the
-
S3 operations failing
- Solution: Check your AWS credentials and ensure the S3 bucket exists and is accessible
-
Scheduler not starting
- Solution: Check the scheduler logs for errors, ensure the required ports are open, and verify the configuration
-
Tasks not completing
- Solution: Check task dependencies, ensure all required resources are available, and verify the task logic
Debugging Tips
- Use the
--local-schedulerflag for easier debugging during development - Check the Luigi scheduler web interface at
http://localhost:8082for task status and logs - Use the
--workersflag to control the number of worker processes - Enable debug logging by setting the
LUIGI_DEBUGenvironment variable to1