← Back to spotify/luigi

How to Deploy & Use spotify/luigi

Luigi Deployment and Usage Guide

Prerequisites

  • Python: Version 2.7 or 3.4+ (Luigi supports both Python 2 and 3)
  • Hadoop: For Hadoop-related tasks (optional, but recommended for big data workflows)
  • boto3: Required for S3 functionality
  • google-api-python-client: Required for BigQuery functionality
  • ujson: Optional, for faster JSON parsing (falls back to standard json if not available)
  • dateutil: Required for date/time operations in range tools

Installation

  1. Clone the repository:

    git clone https://github.com/spotify/luigi.git
    cd luigi
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Install Luigi:

    pip install .
    
  4. Optional dependencies for specific features:

    • For S3 support: pip install boto3
    • For BigQuery support: pip install google-api-python-client
    • For faster JSON parsing: pip install ujson

Configuration

Environment Variables

Luigi uses a configuration file system. Create a luigi.cfg file in your project directory:

[core]
default-scheduler-host: localhost
default-scheduler-port: 8082

[hadoop]
pool: default

[s3]
aws_access_key_id: YOUR_AWS_ACCESS_KEY
aws_secret_access_key: YOUR_AWS_SECRET_KEY

Configuration Parameters

  • Hadoop Pool: Set the Hadoop pool using the pool parameter in the hadoop section
  • S3 Credentials: Configure AWS credentials in the s3 section
  • Scheduler: Configure scheduler host and port in the core section

Build & Run

Running Locally (Development)

  1. Start the Luigi scheduler:

    luigid
    
  2. Run a Luigi task:

    luigi --module my_module MyTask --local-scheduler
    
  3. Run with Hadoop:

    luigi --module my_module MyHadoopTask --hadoop-home /path/to/hadoop
    

Running in Production

  1. Start the Luigi scheduler as a service:

    luigid --background --logdir /var/log/luigi --state-path /var/lib/luigi/state.pickle
    
  2. Run tasks with a central scheduler:

    luigi --module my_module MyTask --scheduler-host localhost --scheduler-port 8082
    

Deployment

Platform Recommendations

  • Local Development: Run directly on your development machine
  • Production: Deploy on a server or cluster with sufficient resources
  • Cloud: Deploy on cloud platforms like AWS, GCP, or Azure
  • Containerized: Use Docker for consistent deployment across environments

Deployment Steps

  1. Package your Luigi tasks:

    python setup.py sdist
    
  2. Deploy to your target environment:

    • For cloud platforms: Use your preferred deployment method (e.g., AWS Elastic Beanstalk, Google App Engine)
    • For containerized environments: Build a Docker image and deploy to your container orchestration platform
  3. Configure the Luigi scheduler:

    • Set up the scheduler to run as a service
    • Configure logging and state persistence

Troubleshooting

Common Issues and Solutions

  1. ImportError: No module named 'boto3'

    • Solution: Install boto3 using pip install boto3
  2. ImportError: No module named 'googleapiclient'

    • Solution: Install google-api-python-client using pip install google-api-python-client
  3. Hadoop tasks failing

    • Solution: Ensure Hadoop is properly installed and configured, and that the hadoop section in your config file is set up correctly
  4. S3 operations failing

    • Solution: Check your AWS credentials and ensure the S3 bucket exists and is accessible
  5. Scheduler not starting

    • Solution: Check the scheduler logs for errors, ensure the required ports are open, and verify the configuration
  6. Tasks not completing

    • Solution: Check task dependencies, ensure all required resources are available, and verify the task logic

Debugging Tips

  • Use the --local-scheduler flag for easier debugging during development
  • Check the Luigi scheduler web interface at http://localhost:8082 for task status and logs
  • Use the --workers flag to control the number of worker processes
  • Enable debug logging by setting the LUIGI_DEBUG environment variable to 1