Tesseract OCR Deployment and Usage Guide
Prerequisites
Runtime Dependencies
- Operating System: Linux, macOS, or Windows
- Compiler: GCC 4.8+, Clang 3.4+, or MSVC 2015+ (see supported compilers)
- Build System: CMake 3.5+
- Image Libraries: Leptonica (required for image processing)
- Additional Libraries (for optional features):
- libarchive (for PDF output)
- libcurl (for training tools)
- libtiff, libpng, libjpeg (for image format support)
Development Dependencies
- Git (for cloning the repository)
- Development headers for all runtime dependencies
- Python 3.x (for training tools and scripts)
Data Dependencies
- Trained Data Files: Language-specific traineddata files from tessdata repository
Installation
Option 1: Install via Pre-built Binary Package
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-<langcode> # For specific languages
macOS (using Homebrew):
brew install tesseract
brew install tesseract-lang # For language data
Windows (using Chocolatey):
choco install tesseract
Option 2: Build from Source
- Clone the repository:
git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract
- Create build directory and configure:
mkdir build
cd build
cmake ..
- Build and install:
cmake --build .
sudo cmake --build . --target install
- Install language data:
# Download and install language data
sudo apt-get install tesseract-ocr-all # Ubuntu/Debian
# Or manually download from tessdata repository
Configuration
Environment Variables
TESSDATA_PREFIX: Path to traineddata files (default:/usr/local/share/tessdataor system-specific)OMP_THREAD_LIMIT: Number of threads for parallel processing (default: number of CPU cores)
Configuration Files
- tesseract.cfg: Global configuration file (location varies by system)
- Custom configs: Create
.cfgfiles in the same directory as traineddata files
API Keys
No API keys required for local usage. For cloud deployment, configure authentication based on your cloud provider.
Build & Run
Command Line Usage
Basic OCR:
tesseract input.png output -l eng
With specific OCR engine mode:
# Legacy engine (Tesseract 3)
tesseract input.png output --oem 0
# Neural net LSTM engine (Tesseract 4+)
tesseract input.png output --oem 1
# Default (best available)
tesseract input.png output --oem 3
With page segmentation mode:
# Automatic page segmentation (default)
tesseract input.png output --psm 3
# Assume single uniform block of text
tesseract input.png output --psm 6
Output formats:
# Plain text (default)
tesseract input.png output
# hOCR (HTML)
tesseract input.png output hocr
# PDF
tesseract input.png output pdf
# TSV (tab-separated values)
tesseract input.png output tsv
Library Usage (C++)
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
int main() {
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
// Initialize tesseract-ocr with English
if (api->Init(NULL, "eng")) {
fprintf(stderr, "Could not initialize tesseract.\n");
return 1;
}
// Open input image
Pix *image = pixRead("input.png");
api->SetImage(image);
// Get OCR result
char* outText = api->GetUTF8Text();
printf("%s", outText);
// Destroy used object and release memory
api->End();
delete [] outText;
pixDestroy(&image);
return 0;
}
Library Usage (C API)
#include <tesseract/capi.h>
int main() {
tesseract::TessBaseAPI *handle = TessBaseAPICreate();
if (TessBaseAPIInit3(handle, NULL, "eng")) {
fprintf(stderr, "Could not initialize tesseract.\n");
TessBaseAPIDelete(handle);
return 1;
}
// Set image
TessBaseAPISetImage(handle, imagedata, width, height, 4, stride);
// Get text
const char* text = TessBaseAPIGetUTF8Text(handle);
printf("%s", text);
TessBaseAPIDelete(handle);
return 0;
}
Deployment
Local Deployment
Linux/macOS:
# After building from source
sudo make install
# or
sudo cmake --build . --target install
Windows:
- Use the installer from GitHub releases
- Or build from source using Visual Studio
Docker Deployment
Dockerfile:
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y \
tesseract-ocr \
tesseract-ocr-eng \
libtesseract-dev \
libleptonica-dev
# Copy your application code
COPY . /app
WORKDIR /app
CMD ["./your-ocr-application"]
Build and run:
docker build -t tesseract-ocr-app .
docker run -it tesseract-ocr-app
Cloud Deployment
AWS Lambda:
- Package Tesseract with your function (Lambda layers)
- Use Amazon Linux 2 base image
- Include traineddata files in deployment package
Google Cloud Functions:
- Use custom runtime with Tesseract installed
- Include traineddata files in
env/directory
Azure Functions:
- Use custom Docker container
- Install Tesseract in container image
Troubleshooting
Common Issues
1. "Error opening data file"
# Solution: Set TESSDATA_PREFIX or install language data
export TESSDATA_PREFIX=/path/to/tessdata
tesseract input.png output -l eng
2. "Leptonica not found"
# Solution: Install Leptonica development package
sudo apt-get install libleptonica-dev
3. "Cannot open shared object file"
# Solution: Add Tesseract library path to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
4. Poor OCR accuracy
# Solutions:
# - Improve image quality (deskew, binarize, denoise)
# - Try different page segmentation modes
# - Use appropriate language data
# - Preprocess image with ImageMagick
convert input.png -density 300 -depth 8 -background white -flatten preprocessed.png
5. Memory issues with large images
# Solutions:
# - Resize large images before processing
# - Use --psm 6 for single block of text
# - Increase available memory
Performance Optimization
Threading:
# Limit threads to avoid oversubscription
OMP_THREAD_LIMIT=4 tesseract input.png output
Batch processing:
# Process multiple images efficiently
for img in *.png; do
tesseract "$img" "${img%.png}" -l eng pdf
done
Image preprocessing:
# Use ImageMagick for preprocessing
convert input.png -deskew 40% -normalize -depth 8 preprocessed.png
tesseract preprocessed.png output -l eng
Debug Mode
Enable debug output:
# Set environment variable
export TESS_DEBUG=1
tesseract input.png output
# Or use command line option
tesseract input.png output --debug 1
Check version and configuration:
tesseract --version
tesseract --print-parameters
This guide provides a comprehensive overview of deploying and using Tesseract OCR. For more detailed information, refer to the official Tesseract documentation.