PDF Processing Pro Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows. Quick start Extract text from PDF import pdfplumber with pdfplumber . open ( "document.pdf" ) as pdf : text = pdf . pages [ 0 ] . extract_text ( ) print ( text ) Analyze PDF form (using included script) python scripts/analyze_form.py input.pdf --output fields.json
Returns: JSON with all form fields, types, and positions
Fill PDF form with validation python scripts/fill_form.py input.pdf data.json output.pdf
Validates all fields before filling, includes error reporting
Extract tables from PDF python scripts/extract_tables.py report.pdf --output tables.csv
Extracts all tables with automatic column detection
- Features
- ✅ Production-ready scripts
- All scripts include:
- Error handling
-
- Graceful failures with detailed error messages
- Validation
-
- Input validation and type checking
- Logging
-
- Configurable logging with timestamps
- Type hints
-
- Full type annotations for IDE support
- CLI interface
- :
- --help
- flag for all scripts
- Exit codes
-
- Proper exit codes for automation
- ✅ Comprehensive workflows
- PDF Forms
-
- Complete form processing pipeline
- Table Extraction
-
- Advanced table detection and extraction
- OCR Processing
-
- Scanned PDF text extraction
- Batch Operations
-
- Process multiple PDFs efficiently
- Validation
- Pre and post-processing validation Advanced topics PDF Form Processing For complete form workflows including: Field analysis and detection Dynamic form filling Validation rules Multi-page forms Checkbox and radio button handling See FORMS.md Table Extraction For complex table extraction: Multi-page tables Merged cells Nested tables Custom table detection Export to CSV/Excel See TABLES.md OCR Processing For scanned PDFs and image-based documents: Tesseract integration Language support Image preprocessing Confidence scoring Batch OCR See OCR.md Included scripts Form processing analyze_form.py - Extract form field information python scripts/analyze_form.py input.pdf [ --output fields.json ] [ --verbose ] fill_form.py - Fill PDF forms with data python scripts/fill_form.py input.pdf data.json output.pdf [ --validate ] validate_form.py - Validate form data before filling python scripts/validate_form.py data.json schema.json Table extraction extract_tables.py - Extract tables to CSV/Excel python scripts/extract_tables.py input.pdf [ --output tables.csv ] [ --format csv | excel ] Text extraction extract_text.py - Extract text with formatting preservation python scripts/extract_text.py input.pdf [ --output text.txt ] [ --preserve-formatting ] Utilities merge_pdfs.py - Merge multiple PDFs python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf split_pdf.py - Split PDF into individual pages python scripts/split_pdf.py input.pdf --output-dir pages/ validate_pdf.py - Validate PDF integrity python scripts/validate_pdf.py input.pdf Common workflows Workflow 1: Process form submissions
1. Analyze form structure
python scripts/analyze_form.py template.pdf --output schema.json
2. Validate submission data
python scripts/validate_form.py submission.json schema.json
3. Fill form
python scripts/fill_form.py template.pdf submission.json completed.pdf
4. Validate output
python scripts/validate_pdf.py completed.pdf Workflow 2: Extract data from reports
1. Extract tables
python scripts/extract_tables.py monthly_report.pdf --output data.csv
2. Extract text for analysis
python scripts/extract_text.py monthly_report.pdf --output report.txt Workflow 3: Batch processing import glob from pathlib import Path import subprocess
Process all PDFs in directory
for pdf_file in glob . glob ( "invoices/*.pdf" ) : output_file = Path ( "processed" ) / Path ( pdf_file ) . name result = subprocess . run ( [ "python" , "scripts/extract_text.py" , pdf_file , "--output" , str ( output_file ) ] , capture_output = True ) if result . returncode == 0 : print ( f"✓ Processed: { pdf_file } " ) else : print ( f"✗ Failed: { pdf_file } - { result . stderr } " ) Error handling All scripts follow consistent error patterns:
Exit codes
0 - Success
1 - File not found
2 - Invalid input
3 - Processing error
4 - Validation error
Example usage in automation
result
subprocess . run ( [ "python" , "scripts/fill_form.py" , . . . ] ) if result . returncode == 0 : print ( "Success" ) elif result . returncode == 4 : print ( "Validation failed - check input data" ) else : print ( f"Error occurred: { result . returncode } " ) Dependencies All scripts require: pip install pdfplumber pypdf pillow pytesseract pandas Optional for OCR:
Install tesseract-ocr system package
macOS: brew install tesseract
Ubuntu: apt-get install tesseract-ocr
Windows: Download from GitHub releases
Performance tips Use batch processing for multiple PDFs Enable multiprocessing with --parallel flag (where supported) Cache extracted data to avoid re-processing Validate inputs early to fail fast Use streaming for large PDFs (>50MB) Best practices Always validate inputs before processing Use try-except in custom scripts Log all operations for debugging Test with sample PDFs before production Set timeouts for long-running operations Check exit codes in automation Backup originals before modification Troubleshooting Common issues "Module not found" errors : pip install -r requirements.txt Tesseract not found :
Install tesseract system package (see Dependencies)
Memory errors with large PDFs :
Process page by page instead of loading entire PDF
with pdfplumber . open ( "large.pdf" ) as pdf : for page in pdf . pages : text = page . extract_text ( )
Process page immediately
Permission errors : chmod +x scripts/*.py Getting help All scripts support --help : python scripts/analyze_form.py --help python scripts/extract_tables.py --help For detailed documentation on specific topics, see: FORMS.md - Complete form processing guide TABLES.md - Advanced table extraction OCR.md - Scanned PDF processing