Work with office documents: PDF, Excel, Word, and PowerPoint.

Format Overview

| PDF | .pdf | Binary/text | Reports, forms, archives

| Excel | .xlsx | XML in ZIP | Data, calculations, models

| Word | .docx | XML in ZIP | Text documents, contracts

| PowerPoint | .pptx | XML in ZIP | Presentations, slides

Key concept: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.

PDF Processing

PDF Tools

| Basic read/write | pypdf

| Text extraction | pdfplumber

| Table extraction | pdfplumber

| Create PDFs | reportlab

| OCR scanned PDFs | pytesseract + pdf2image

| Command line | qpdf, pdftotext

Common Operations

| Merge | Loop through files, add pages to writer

| Split | Create new writer per page

| Extract tables | Use pdfplumber, convert to DataFrame

| Rotate | Call .rotate(degrees) on page

| Encrypt | Use writer's .encrypt() method

| OCR | Convert to images, run pytesseract

Excel Processing

Excel Tools

| Data analysis | pandas

| Formulas & formatting | openpyxl

| Simple CSV | pandas

| Financial models | openpyxl

Critical Rule: Use Formulas

| Wrong: Calculate in Python, write value | Static number, breaks when data changes

| Right: Write Excel formula | Dynamic, recalculates automatically

Financial Model Standards

| Blue text | Hardcoded inputs

| Black text | Formulas

| Green text | Links to other sheets

| Yellow fill | Needs attention

Common Formula Errors

| #REF! | Invalid cell reference

| #DIV/0! | Division by zero

| #VALUE! | Wrong data type

| #NAME? | Unknown function name

Word Processing

Word Tools

| Text extraction | pandoc

| Create new | python-docx or docx-js

| Simple edits | python-docx

| Tracked changes | Direct XML editing

Document Structure

| word/document.xml | Main content

| word/comments.xml | Comments

| word/media/ | Images

Tracked Changes (Redlining)

| Deletion | <w:del><w:delText>...</w:delText></w:del>

| Insertion | <w:ins><w:t>...</w:t></w:ins>

Key concept: For professional/legal documents, use tracked changes XML rather than replacing text directly.

PowerPoint Processing

PowerPoint Tools

| Text extraction | markitdown

| Create new | pptxgenjs (JS) or python-pptx

| Edit existing | Direct XML or python-pptx

Slide Structure

| ppt/slides/slide{N}.xml | Slide content

| ppt/notesSlides/ | Speaker notes

| ppt/slideMasters/ | Master templates

| ppt/media/ | Images

Design Principles

| Fonts | Use web-safe: Arial, Helvetica, Georgia

| Layout | Two-column preferred, avoid vertical stacking

| Hierarchy | Size, weight, color for emphasis

| Consistency | Repeat patterns across slides

Converting Between Formats

| Any → PDF | LibreOffice headless

| PDF → Images | pdftoppm

| DOCX → Markdown | pandoc

| Any → Text | Appropriate extractor

Best Practices

| Use formulas in Excel | Dynamic calculations

| Preserve formatting on edit | Don't lose styles

| Test output opens correctly | Catch corruption early

| Use tracked changes for contracts | Audit trail

| Extract to markdown for analysis | Easier to process

Common Packages

| Python | pypdf, pdfplumber, openpyxl, python-docx, python-pptx

| JavaScript | docx, pptxgenjs

| CLI | pandoc, qpdf, pdftotext, libreoffice

document-processing

安装