MinerU Extract (official API) Use MinerU as an upstream “content normalizer”: submit a URL to MinerU, poll for completion, download the result zip, and extract the main Markdown. Quick start (MCP-aligned) We align to the MinerU MCP mental model, but we do not run an MCP server. Primary script (MCP-style): scripts/mineru_parse_documents.py Input: --file-sources (comma/newline-separated) Output: JSON contract on stdout: { ok, items, errors } Low-level script (single URL): scripts/mineru_extract.py Auth: Set MINERU_TOKEN (Bearer token from mineru.net) Default model heuristic: URLs ending with .pdf/.doc/.ppt/.png/.jpg → pipeline Otherwise → MinerU-HTML (best for HTML pages like WeChat articles) 1) Configure token (skill-local) Put secrets in skill root .env (do not paste into chat outputs):
In the mineru-extract skill directory: .env
MINERU_TOKEN
- your_token_here
- MINERU_API_BASE
- =
- https://mineru.net
- 2) Parse URL(s) → Markdown (recommended)
- MCP-style wrapper (returns JSON, optionally includes markdown text):
- python3 mineru-extract/scripts/mineru_parse_documents.py
- \
- --file-sources
- "
- \n
" - \
- --language
- ch
- \
- --enable-ocr
- \
- --model-version MinerU-HTML
- If you want the markdown content inline in the JSON (can be large):
- python3 mineru-extract/scripts/mineru_parse_documents.py
- \
- --file-sources
- "
" - \
- --model-version MinerU-HTML
- \
- --emit-markdown --max-chars
- 20000
- Low-level (single URL, print markdown to stdout):
- python3 mineru-extract/scripts/mineru_extract.py
- "
" - --model
- MinerU-HTML
- >
- /tmp/out.md
- Output
- The script always downloads + extracts the MinerU result zip to:
- ~/.openclaw/workspace/mineru/
/ - It writes:
- result.zip
- extracted files (Markdown + JSON + assets)
- It prints a JSON summary to
- stderr
- with paths:
- task_id
- ,
- full_zip_url
- ,
- out_dir
- ,
- markdown_path
- Parameters (common)
- --model
- :
- pipeline | vlm | MinerU-HTML
- (HTML requires
- MinerU-HTML
- )
- --ocr/--no-ocr
-
- enable OCR (effective for
- pipeline
- /
- vlm
- )
- --table/--no-table
-
- table recognition
- --formula/--no-formula
- formula recognition --language ch|en|... --page-ranges "2,4-6" (non-HTML) --timeout 600 / --poll-interval 2 Failure modes & fallbacks MinerU may fail to fetch some URLs (anti-bot / geo / login). Fallback: provide an HTML file or a PDF/long screenshot; then implement “upload + parse” flow with MinerU batch upload endpoints. Always report the failing URL + MinerU err_msg and keep an original-source link in outputs. References MinerU API docs: https://mineru.net/apiManage/docs MinerU output files: https://opendatalab.github.io/MinerU/reference/output_files/