Office to Markdown Skill Overview This skill enables conversion from various Office formats to Markdown using markitdown - Microsoft's open-source tool for converting documents to Markdown. Perfect for making Office content searchable, version-controllable, and AI-friendly. How to Use Provide the Office file (Word, Excel, PowerPoint, PDF, etc.) Optionally specify conversion options I'll convert it to clean Markdown Example prompts: "Convert this Word document to Markdown" "Turn this PowerPoint into Markdown notes" "Extract content from this PDF as Markdown" "Convert this Excel file to Markdown tables" Domain Knowledge markitdown Fundamentals from markitdown import MarkItDown
Initialize converter
md
MarkItDown ( )
Convert file
result
md . convert ( "document.docx" ) print ( result . text_content )
Save to file
with open ( "output.md" , "w" ) as f : f . write ( result . text_content ) Supported Formats Format Extension Notes Word .docx Full text, tables, basic formatting Excel .xlsx Converts to Markdown tables PowerPoint .pptx Slides as sections PDF .pdf Text extraction HTML .html Clean markdown Images .jpg, .png OCR with vision model Audio .mp3, .wav Transcription ZIP .zip Processes contained files Basic Usage Python API from markitdown import MarkItDown
Simple conversion
md
MarkItDown ( ) result = md . convert ( "document.docx" )
Access content
markdown_text
result . text_content
With options
md
MarkItDown ( llm_client = None ,
Optional LLM for enhanced processing
llm_model
None
Model name if using LLM
) Command Line
Install
pip install markitdown
Convert file
markitdown document.docx
output.md
Or with output file
markitdown document.docx -o output.md Word Document Conversion from markitdown import MarkItDown md = MarkItDown ( )
Convert Word document
result
md . convert ( "report.docx" )
Output preserves:
- Headings (as # headers)
- Bold/italic formatting
- Lists (bulleted and numbered)
- Tables (as markdown tables)
- Hyperlinks
print ( result . text_content ) Example Output:
Annual Report 2024
Executive Summary This report summarizes the key achievements and challenges...
Key Metrics | Metric | 2023 | 2024 | Change | |
|
|
|
| | Revenue | $10M | $12M | +20% | | Users | 50K | 75K | +50% |
Detailed Analysis The following sections provide... Excel Conversion from markitdown import MarkItDown md = MarkItDown ( ) result = md . convert ( "data.xlsx" )
Each sheet becomes a section
Data becomes markdown tables
print ( result . text_content ) Example Output:
Sheet1 | Name | Department | Salary | |
|
|
| | John | Engineering | $80,000 | | Jane | Marketing | $75,000 |
Sheet2 | Product | Q1 | Q2 | Q3 | Q4 | |
|
|
|
|
| | Widget A | 100 | 120 | 150 | 180 | PowerPoint Conversion from markitdown import MarkItDown md = MarkItDown ( ) result = md . convert ( "presentation.pptx" )
Each slide becomes a section
Speaker notes included if present
print ( result . text_content ) Example Output:
Slide 1: Company Overview Our mission is to...
Key Points
Innovation first
Customer focused
Global reach
Slide 2: Market Analysis The market opportunity is significant... ** Notes: ** Mention the competitor analysis here PDF Conversion from markitdown import MarkItDown md = MarkItDown ( ) result = md . convert ( "document.pdf" )
Extracts text content
Tables converted where detected
print ( result . text_content ) Image Conversion (with Vision Model) from markitdown import MarkItDown import anthropic
Use Claude for image description
client
anthropic . Anthropic ( ) md = MarkItDown ( llm_client = client , llm_model = "claude-sonnet-4-20250514" ) result = md . convert ( "diagram.png" ) print ( result . text_content )
Output: Description of the image content
- Batch Conversion
- from
- markitdown
- import
- MarkItDown
- from
- pathlib
- import
- Path
- def
- batch_convert
- (
- input_dir
- ,
- output_dir
- )
- :
- """Convert all Office files to Markdown."""
- md
- =
- MarkItDown
- (
- )
- input_path
- =
- Path
- (
- input_dir
- )
- output_path
- =
- Path
- (
- output_dir
- )
- output_path
- .
- mkdir
- (
- exist_ok
- =
- True
- )
- extensions
- =
- [
- '.docx'
- ,
- '.xlsx'
- ,
- '.pptx'
- ,
- '.pdf'
- ]
- for
- ext
- in
- extensions
- :
- for
- file
- in
- input_path
- .
- glob
- (
- f'*
- {
- ext
- }
- '
- )
- :
- try
- :
- result
- =
- md
- .
- convert
- (
- str
- (
- file
- )
- )
- output_file
- =
- output_path
- /
- f"
- {
- file
- .
- stem
- }
- .md"
- with
- open
- (
- output_file
- ,
- 'w'
- )
- as
- f
- :
- f
- .
- write
- (
- result
- .
- text_content
- )
- (
- f"Converted:
- {
- file
- .
- name
- }
- "
- )
- except
- Exception
- as
- e
- :
- (
- f"Error converting
- {
- file
- .
- name
- }
- :
- {
- e
- }
- "
- )
- batch_convert
- (
- './documents'
- ,
- './markdown'
- )
- Best Practices
- Check Output Quality
-
- Review converted Markdown for accuracy
- Handle Tables
-
- Complex tables may need manual adjustment
- Preserve Structure
-
- Use consistent heading levels in source docs
- Image Handling
-
- Consider using vision models for important images
- Version Control
- Store converted Markdown in Git for tracking Common Patterns Document Archive import os from datetime import datetime from markitdown import MarkItDown def archive_document ( doc_path , archive_dir ) : """Convert and archive Office document to Markdown.""" md = MarkItDown ( ) result = md . convert ( doc_path )
Create archive structure
date_str
datetime . now ( ) . strftime ( '%Y-%m-%d' ) filename = os . path . basename ( doc_path ) base_name = os . path . splitext ( filename ) [ 0 ]
Save with metadata
output_content
f"""--- source: { filename } converted: { date_str }
{ result . text_content } """ output_path = os . path . join ( archive_dir , f" { base_name } .md" ) with open ( output_path , 'w' ) as f : f . write ( output_content ) return output_path AI-Ready Corpus from markitdown import MarkItDown from pathlib import Path import json def create_ai_corpus ( doc_folder , output_file ) : """Convert documents to JSON corpus for AI training/RAG.""" md = MarkItDown ( ) corpus = [ ] for doc in Path ( doc_folder ) . glob ( '*/' ) : if doc . suffix in [ '.docx' , '.pdf' , '.pptx' , '.xlsx' ] : try : result = md . convert ( str ( doc ) ) corpus . append ( { 'source' : str ( doc ) , 'filename' : doc . name , 'content' : result . text_content , 'type' : doc . suffix [ 1 : ] } ) except Exception as e : print ( f"Skipped { doc . name } : { e } " ) with open ( output_file , 'w' ) as f : json . dump ( corpus , f , indent = 2 ) print ( f"Created corpus with { len ( corpus ) } documents" ) return corpus Examples Example 1: Convert Documentation Suite from markitdown import MarkItDown from pathlib import Path def convert_docs_to_wiki ( docs_folder , wiki_folder ) : """Convert all Office docs to markdown wiki structure.""" md = MarkItDown ( ) docs_path = Path ( docs_folder ) wiki_path = Path ( wiki_folder )
Create wiki structure
wiki_path . mkdir ( exist_ok = True )
Create index
index_content
"# Documentation Index\n\n" for doc in sorted ( docs_path . glob ( '*/.docx' ) ) : try : result = md . convert ( str ( doc ) )
Create relative path in wiki
rel_path
doc . relative_to ( docs_path ) output_file = wiki_path / rel_path . with_suffix ( '.md' ) output_file . parent . mkdir ( parents = True , exist_ok = True )
Write markdown
with open ( output_file , 'w' ) as f : f . write ( result . text_content )
Add to index
link
str ( rel_path . with_suffix ( '.md' ) ) . replace ( '\' , '/' ) index_content += f"- { doc . stem } \n" print ( f"Converted: { doc . name } " ) except Exception as e : print ( f"Error: { doc . name } - { e } " )
Write index
with open ( wiki_path / 'index.md' , 'w' ) as f : f . write ( index_content ) convert_docs_to_wiki ( './company_docs' , './wiki' ) Example 2: Meeting Notes Processor from markitdown import MarkItDown import re from datetime import datetime def process_meeting_notes ( pptx_path ) : """Extract and structure meeting notes from PowerPoint.""" md = MarkItDown ( ) result = md . convert ( pptx_path )
Parse the markdown
content
result . text_content
Extract sections
sections
{ 'attendees' : [ ] , 'agenda' : [ ] , 'decisions' : [ ] , 'action_items' : [ ] } current_section = None for line in content . split ( '\n' ) : line_lower = line . lower ( ) if 'attendee' in line_lower or 'participant' in line_lower : current_section = 'attendees' elif 'agenda' in line_lower : current_section = 'agenda' elif 'decision' in line_lower : current_section = 'decisions' elif 'action' in line_lower : current_section = 'action_items' elif line . strip ( ) . startswith ( ( '-' , '*' , '•' ) ) and current_section : sections [ current_section ] . append ( line . strip ( ) [ 1 : ] . strip ( ) )
Generate structured output
output
f"""# Meeting Notes Date: { datetime . now ( ) . strftime ( '%Y-%m-%d' ) } Source: { pptx_path }
Attendees
{ chr ( 10 ) . join ( '- ' + a for a in sections [ 'attendees' ] ) }
Agenda
{ chr ( 10 ) . join ( '- ' + a for a in sections [ 'agenda' ] ) }
Decisions Made
{ chr ( 10 ) . join ( '- ' + d for d in sections [ 'decisions' ] ) }
Action Items
{ chr ( 10 ) . join ( '- [ ] ' + a for a in sections [ 'action_items' ] ) } """ return output notes = process_meeting_notes ( 'team_meeting.pptx' ) print ( notes ) Example 3: Excel to Documentation from markitdown import MarkItDown def excel_to_data_dictionary ( xlsx_path ) : """Convert Excel data model to data dictionary documentation.""" md = MarkItDown ( ) result = md . convert ( xlsx_path )
Add documentation structure
doc
f"""# Data Dictionary
Generated from: {
xlsx_path
}
{
result
.
text_content
}
Usage Notes
- All tables are derived from the source Excel file
- Review data types and constraints before use
- Contact data team for clarifications
Change Log
| Date | Change | Author |
|---|---|---|
| { | ||
| datetime | ||
| . | ||
| now | ||
| ( | ||
| ) | ||
| . | ||
| strftime | ||
| ( | ||
| '%Y-%m-%d' | ||
| ) | ||
| } | ||
| Initial generation | Auto | |
| """ | ||
| return | ||
| doc | ||
| documentation | ||
| = | ||
| excel_to_data_dictionary | ||
| ( | ||
| 'data_model.xlsx' | ||
| ) | ||
| with | ||
| open | ||
| ( | ||
| 'data_dictionary.md' | ||
| , | ||
| 'w' | ||
| ) | ||
| as | ||
| f | ||
| : | ||
| f | ||
| . | ||
| write | ||
| ( | ||
| documentation | ||
| ) | ||
| Limitations | ||
| Complex formatting may be simplified | ||
| Images are not embedded (use vision model for descriptions) | ||
| Some table structures may not convert perfectly | ||
| Track changes in Word are not preserved | ||
| Comments may not be extracted | ||
| Installation | ||
| pip | ||
| install | ||
| markitdown | ||
| # For image/audio processing | ||
| pip | ||
| install | ||
| markitdown | ||
| [ | ||
| all | ||
| ] | ||
| # For specific features | ||
| pip | ||
| install | ||
| markitdown | ||
| [ | ||
| images | ||
| ] | ||
| # Image OCR | ||
| pip | ||
| install | ||
| markitdown | ||
| [ | ||
| audio | ||
| ] | ||
| # Audio transcription | ||
| Resources | ||
| GitHub Repository | ||
| PyPI Package | ||
| Supported Formats |