Work with office documents: PDF, Excel, Word, and PowerPoint.
Format Overview
| PDF | .pdf | Binary/text | Reports, forms, archives
| Excel | .xlsx | XML in ZIP | Data, calculations, models
| Word | .docx | XML in ZIP | Text documents, contracts
| PowerPoint | .pptx | XML in ZIP | Presentations, slides
Key concept: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.
PDF Processing
PDF Tools
| Basic read/write | pypdf
| Text extraction | pdfplumber
| Table extraction | pdfplumber
| Create PDFs | reportlab
| OCR scanned PDFs | pytesseract + pdf2image
| Command line | qpdf, pdftotext
Common Operations
| Merge | Loop through files, add pages to writer
| Split | Create new writer per page
| Extract tables | Use pdfplumber, convert to DataFrame
| Rotate
| Call .rotate(degrees) on page
| Encrypt
| Use writer's .encrypt() method
| OCR | Convert to images, run pytesseract
Excel Processing
Excel Tools
| Data analysis | pandas
| Formulas & formatting | openpyxl
| Simple CSV | pandas
| Financial models | openpyxl
Critical Rule: Use Formulas
| Wrong: Calculate in Python, write value | Static number, breaks when data changes
| Right: Write Excel formula | Dynamic, recalculates automatically
Financial Model Standards
| Blue text | Hardcoded inputs
| Black text | Formulas
| Green text | Links to other sheets
| Yellow fill | Needs attention
Common Formula Errors
| #REF! | Invalid cell reference
| #DIV/0! | Division by zero
| #VALUE! | Wrong data type
| #NAME? | Unknown function name
Word Processing
Word Tools
| Text extraction | pandoc
| Create new | python-docx or docx-js
| Simple edits | python-docx
| Tracked changes | Direct XML editing
Document Structure
| word/document.xml
| Main content
| word/comments.xml
| Comments
| word/media/
| Images
Tracked Changes (Redlining)
| Deletion
| <w:del><w:delText>...</w:delText></w:del>
| Insertion
| <w:ins><w:t>...</w:t></w:ins>
Key concept: For professional/legal documents, use tracked changes XML rather than replacing text directly.
PowerPoint Processing
PowerPoint Tools
| Text extraction | markitdown
| Create new | pptxgenjs (JS) or python-pptx
| Edit existing | Direct XML or python-pptx
Slide Structure
| ppt/slides/slide{N}.xml
| Slide content
| ppt/notesSlides/
| Speaker notes
| ppt/slideMasters/
| Master templates
| ppt/media/
| Images
Design Principles
| Fonts | Use web-safe: Arial, Helvetica, Georgia
| Layout | Two-column preferred, avoid vertical stacking
| Hierarchy | Size, weight, color for emphasis
| Consistency | Repeat patterns across slides
Converting Between Formats
| Any → PDF | LibreOffice headless
| PDF → Images | pdftoppm
| DOCX → Markdown | pandoc
| Any → Text | Appropriate extractor
Best Practices
| Use formulas in Excel | Dynamic calculations
| Preserve formatting on edit | Don't lose styles
| Test output opens correctly | Catch corruption early
| Use tracked changes for contracts | Audit trail
| Extract to markdown for analysis | Easier to process
Common Packages
| Python | pypdf, pdfplumber, openpyxl, python-docx, python-pptx
| JavaScript | docx, pptxgenjs
| CLI | pandoc, qpdf, pdftotext, libreoffice