liteparse

安装量: 1.3K
排名: #3514

安装

npx skills add https://github.com/run-llama/llamaparse-agent-skills --skill liteparse

LiteParse Skill Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally with LiteParse: fast, lightweight, no cloud dependencies or LLM required. Initial Setup When this skill is invoked, respond with: I'm ready to use LiteParse to parse files locally. Before we begin, please confirm that: - @llamaindex/liteparse is installed globally (npm i -g @llamaindex/liteparse) - The lit CLI command is available in your terminal If both are set, please provide: 1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.) 2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc. 3. What you'd like to do with the parsed content. I will produce the appropriate lit CLI command or TypeScript script, and once approved, report the results. Then wait for the user's input. Step 0 — Install LiteParse (if needed) If liteparse is not yet installed, install it globally: npm i -g @llamaindex/liteparse Verify installation: lit --version For Office document support (DOCX, PPTX, XLSX), LibreOffice is required:

macOS

brew install --cask libreoffice

Ubuntu/Debian

apt-get install libreoffice For image parsing, ImageMagick is required:

macOS

brew install imagemagick

Ubuntu/Debian

apt-get install imagemagick Step 1 — Produce the CLI Command or Script Parse a Single File

Basic text extraction

lit parse document.pdf

JSON output saved to a file

lit parse document.pdf --format json -o output.json

Specific page range

lit parse document.pdf --target-pages "1-5,10,15-20"

Disable OCR (faster, text-only PDFs)

lit parse document.pdf --no-ocr

Use an external HTTP OCR server for higher accuracy

lit parse document.pdf --ocr-server-url http://localhost:8828/ocr

Higher DPI for better quality

lit parse document.pdf --dpi 300 Batch Parse a Directory lit batch-parse ./input-directory ./output-directory

Only process PDFs, recursively

lit batch-parse ./input ./output --extension .pdf --recursive Generate Page Screenshots Screenshots are useful for LLM agents that need to see visual layout.

All pages

lit screenshot document.pdf -o ./screenshots

Specific pages

lit screenshot document.pdf --pages "1,3,5" -o ./screenshots

High-DPI PNG

lit screenshot document.pdf --dpi 300 --format png -o ./screenshots

Page range

lit screenshot document.pdf --pages "1-10" -o ./screenshots Step 3 — Key Options Reference OCR Options Option Description (default) Tesseract.js — zero setup, built-in --ocr-language fra Set OCR language (ISO code) --ocr-server-url Use external HTTP OCR server (EasyOCR, PaddleOCR, custom) --no-ocr Disable OCR entirely Output Options Option Description --format json Structured JSON with bounding boxes --format text Plain text (default) -o Save output to file Performance / Quality Options Option Description --dpi Rendering DPI (default: 150; use 300 for high quality) --max-pages Limit pages parsed --target-pages Parse specific pages (e.g. "1-5,10" ) --no-precise-bbox Disable precise bounding boxes (faster) --skip-diagonal-text Ignore rotated/diagonal text --preserve-small-text Keep very small text that would otherwise be dropped Step 4 — Using a Config File For repeated use with consistent options, generate a liteparse.config.json : { "ocrLanguage" : "en" , "ocrEnabled" : true , "maxPages" : 1000 , "dpi" : 150 , "outputFormat" : "json" , "preciseBoundingBox" : true , "skipDiagonalText" : false , "preserveVerySmallText" : false } For an HTTP OCR server: { "ocrServerUrl" : "http://localhost:8828/ocr" , "ocrLanguage" : "en" , "outputFormat" : "json" } Use with: lit parse document.pdf --config liteparse.config.json Step 5 — HTTP OCR Server API (Advanced) If the user wants to plug in a custom OCR backend, the server must implement: Endpoint : POST /ocr Accepts : file (multipart) and language (string) parameters Returns : { "results" : [ { "text" : "Hello" , "bbox" : [ x1 , y1 , x2 , y2 ] , "confidence" : 0.98 } ] } Ready-to-use wrappers exist for EasyOCR and PaddleOCR in the LiteParse repo. Supported Input Formats Category Formats PDF .pdf Word .doc , .docx , .docm , .odt , .rtf PowerPoint .ppt , .pptx , .pptm , .odp Spreadsheets .xls , .xlsx , .xlsm , .ods , .csv , .tsv Images .jpg , .jpeg , .png , .gif , .bmp , .tiff , .webp , .svg Office documents require LibreOffice; images require ImageMagick. LiteParse auto-converts these formats to PDF before parsing.

返回排行榜