ocr-document-processor

安装量: 1.6K
排名: #959

安装

npx skills add https://github.com/dkyazzentwatwa/chatgpt-skills --skill ocr-document-processor

OCR Document Processor

Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing.

Core Capabilities Image OCR: Extract text from PNG, JPEG, TIFF, BMP images PDF OCR: Process scanned PDFs page by page Multi-language: Support for 100+ languages Structured Output: Plain text, Markdown, JSON, or HTML Table Detection: Extract tabular data to CSV/JSON Batch Processing: Process multiple documents at once Quality Assessment: Confidence scoring for OCR results Quick Start from scripts.ocr_processor import OCRProcessor

Simple text extraction

processor = OCRProcessor("document.png") text = processor.extract_text() print(text)

Extract to structured format

result = processor.extract_structured() print(result['text']) print(result['confidence']) print(result['blocks']) # Text blocks with positions

Core Workflow 1. Basic Text Extraction from scripts.ocr_processor import OCRProcessor

From image

processor = OCRProcessor("scan.png") text = processor.extract_text()

From PDF

processor = OCRProcessor("scanned.pdf") text = processor.extract_text() # All pages

Specific pages

text = processor.extract_text(pages=[1, 2, 3])

  1. Structured Extraction

Get detailed results

result = processor.extract_structured()

Result contains:

- text: Full extracted text

- blocks: Text blocks with bounding boxes

- lines: Individual lines

- words: Individual words with confidence

- confidence: Overall confidence score

- language: Detected language

  1. Export Formats

Export to Markdown

processor.export_markdown("output.md")

Export to JSON

processor.export_json("output.json")

Export to searchable PDF

processor.export_searchable_pdf("searchable.pdf")

Export to HTML

processor.export_html("output.html")

Language Support

Specify language for better accuracy

processor = OCRProcessor("german_doc.png", lang='deu')

Multiple languages

processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')

Auto-detect language

processor = OCRProcessor("document.png", lang='auto')

Supported Languages (Common) Code Language Code Language eng English fra French deu German spa Spanish ita Italian por Portuguese rus Russian chi_sim Chinese (Simplified) chi_tra Chinese (Traditional) jpn Japanese kor Korean ara Arabic hin Hindi nld Dutch Image Preprocessing

Preprocessing improves OCR accuracy on low-quality images.

Enable preprocessing

processor = OCRProcessor("noisy_scan.png") processor.preprocess( deskew=True, # Fix rotation denoise=True, # Remove noise threshold=True, # Binarize image contrast=1.5 # Enhance contrast ) text = processor.extract_text()

Available Preprocessing Options Option Description Default deskew Correct skewed/rotated images False denoise Remove noise and artifacts False threshold Convert to black/white False threshold_method 'otsu', 'adaptive', 'simple' 'otsu' contrast Contrast factor (1.0 = no change) 1.0 sharpen Sharpen factor (0 = none) 0 scale Upscale factor for small text 1.0 remove_shadows Remove shadow artifacts False Table Extraction

Extract tables from document

tables = processor.extract_tables()

Each table is a list of rows

for table in tables: for row in table: print(row)

Export tables to CSV

processor.export_tables_csv("tables/")

Export to JSON

processor.export_tables_json("tables.json")

PDF Processing Multi-Page PDFs

Process all pages

processor = OCRProcessor("document.pdf") full_text = processor.extract_text()

Process specific pages

page_3 = processor.extract_text(pages=[3])

Get per-page results

results = processor.extract_by_page() for page_num, text in results.items(): print(f"Page {page_num}: {len(text)} characters")

Create Searchable PDF

Convert scanned PDF to searchable PDF

processor = OCRProcessor("scanned.pdf") processor.export_searchable_pdf("searchable.pdf")

Batch Processing from scripts.ocr_processor import batch_ocr

Process directory of images

results = batch_ocr( input_dir="scans/", output_dir="extracted/", output_format="markdown", lang="eng", recursive=True )

print(f"Processed: {results['success']} files") print(f"Failed: {results['failed']} files")

Receipt/Document Parsing Receipt Extraction

Parse receipt structure

processor = OCRProcessor("receipt.jpg") receipt_data = processor.parse_receipt()

Returns structured data:

- vendor: Store name

- date: Transaction date

- items: List of items with prices

- subtotal: Subtotal amount

- tax: Tax amount

- total: Total amount

Business Card Parsing

Extract business card info

processor = OCRProcessor("card.jpg") contact = processor.parse_business_card()

Returns:

- name: Person's name

- title: Job title

- company: Company name

- email: Email addresses

- phone: Phone numbers

- address: Physical address

- website: Website URLs

Configuration processor = OCRProcessor("document.png")

Configure OCR settings

processor.config.update({ 'psm': 3, # Page segmentation mode 'oem': 3, # OCR engine mode 'dpi': 300, # DPI for processing 'timeout': 30, # Timeout in seconds 'min_confidence': 60, # Minimum word confidence })

Page Segmentation Modes (PSM) Mode Description 0 Orientation and script detection only 1 Automatic page segmentation with OSD 3 Fully automatic page segmentation (default) 4 Assume single column of text 6 Assume single uniform block of text 7 Treat image as single text line 8 Treat image as single word 11 Sparse text. Find as much text as possible 12 Sparse text with OSD Quality Assessment

Get confidence scores

result = processor.extract_structured()

Overall confidence (0-100)

print(f"Confidence: {result['confidence']}%")

Per-word confidence

for word in result['words']: print(f"{word['text']}: {word['confidence']}%")

Filter low-confidence words

high_conf_words = [w for w in result['words'] if w['confidence'] > 80]

Output Formats Markdown Export processor.export_markdown("output.md")

Output includes:

Document title (if detected) Structured headings Paragraphs Tables (as Markdown tables) Page breaks for multi-page docs JSON Export processor.export_json("output.json")

Output structure:

{ "source": "document.pdf", "pages": 5, "language": "eng", "confidence": 92.5, "text": "Full extracted text...", "blocks": [ { "type": "paragraph", "text": "Block text...", "bbox": [x, y, width, height], "confidence": 95.2 } ], "tables": [...] }

HTML Export processor.export_html("output.html")

Creates styled HTML with:

Preserved layout approximation Highlighted low-confidence regions Embedded images (optional) Print-friendly styling CLI Usage

Basic extraction

python ocr_processor.py image.png -o output.txt

Extract to markdown

python ocr_processor.py document.pdf -o output.md --format markdown

Specify language

python ocr_processor.py german.png --lang deu

Batch processing

python ocr_processor.py scans/ -o extracted/ --batch

With preprocessing

python ocr_processor.py noisy.png --preprocess --deskew --denoise

Error Handling from scripts.ocr_processor import OCRProcessor, OCRError

try: processor = OCRProcessor("document.png") text = processor.extract_text() except OCRError as e: print(f"OCR failed: {e}") except FileNotFoundError: print("File not found")

Performance Tips Image Quality: Higher resolution (300+ DPI) improves accuracy Preprocessing: Use for low-quality scans Language: Specifying language improves speed and accuracy PSM Mode: Choose appropriate mode for document type Large Files: Process PDFs page by page for memory efficiency Limitations Handwritten text: Limited accuracy Complex layouts: May lose structure Very low quality: Preprocessing helps but has limits Non-Latin scripts: Require specific language packs Dependencies pytesseract>=0.3.10 Pillow>=10.0.0 PyMuPDF>=1.23.0 opencv-python>=4.8.0 numpy>=1.24.0

System Requirements Tesseract OCR engine must be installed Language data files for non-English languages

返回排行榜