Office to Markdown Skill Overview This skill enables conversion from various Office formats to Markdown using markitdown - Microsoft's open-source tool for converting documents to Markdown. Perfect for making Office content searchable, version-controllable, and AI-friendly. How to Use Provide the Office file (Word, Excel, PowerPoint, PDF, etc.) Optionally specify conversion options I'll convert it to clean Markdown Example prompts: "Convert this Word document to Markdown" "Turn this PowerPoint into Markdown notes" "Extract content from this PDF as Markdown" "Convert this Excel file to Markdown tables" Domain Knowledge markitdown Fundamentals from markitdown import MarkItDown

Initialize converter

md

MarkItDown ( )

Convert file

result

md . convert ( "document.docx" ) print ( result . text_content )

Save to file

with open ( "output.md" , "w" ) as f : f . write ( result . text_content ) Supported Formats Format Extension Notes Word .docx Full text, tables, basic formatting Excel .xlsx Converts to Markdown tables PowerPoint .pptx Slides as sections PDF .pdf Text extraction HTML .html Clean markdown Images .jpg, .png OCR with vision model Audio .mp3, .wav Transcription ZIP .zip Processes contained files Basic Usage Python API from markitdown import MarkItDown

Simple conversion

md

MarkItDown ( ) result = md . convert ( "document.docx" )

Access content

markdown_text

result . text_content

With options

md

MarkItDown ( llm_client = None ,

Optional LLM for enhanced processing

llm_model

None

Model name if using LLM

) Command Line

Install

pip install markitdown

Convert file

markitdown document.docx

output.md

Or with output file

markitdown document.docx -o output.md Word Document Conversion from markitdown import MarkItDown md = MarkItDown ( )

Convert Word document

result

md . convert ( "report.docx" )

Output preserves:

- Headings (as # headers)

- Bold/italic formatting

- Lists (bulleted and numbered)

- Tables (as markdown tables)

- Hyperlinks

print ( result . text_content ) Example Output:

Annual Report 2024

Executive Summary This report summarizes the key achievements and challenges...

|

| | Revenue | $10M | $12M | +20% | | Users | 50K | 75K | +50% |

Detailed Analysis The following sections provide... Excel Conversion from markitdown import MarkItDown md = MarkItDown ( ) result = md . convert ( "data.xlsx" )

Each sheet becomes a section

Data becomes markdown tables

print ( result . text_content ) Example Output:

|

Sheet2 | Product | Q1 | Q2 | Q3 | Q4 | |

|

| | Widget A | 100 | 120 | 150 | 180 | PowerPoint Conversion from markitdown import MarkItDown md = MarkItDown ( ) result = md . convert ( "presentation.pptx" )

Each slide becomes a section

Speaker notes included if present

print ( result . text_content ) Example Output:

Slide 1: Company Overview Our mission is to...

Key Points

Innovation first

Customer focused

Global reach

Slide 2: Market Analysis The market opportunity is significant... ** Notes: ** Mention the competitor analysis here PDF Conversion from markitdown import MarkItDown md = MarkItDown ( ) result = md . convert ( "document.pdf" )

Extracts text content

Tables converted where detected

print ( result . text_content ) Image Conversion (with Vision Model) from markitdown import MarkItDown import anthropic

Use Claude for image description

client

anthropic . Anthropic ( ) md = MarkItDown ( llm_client = client , llm_model = "claude-sonnet-4-20250514" ) result = md . convert ( "diagram.png" ) print ( result . text_content )

Output: Description of the image content

Batch Conversion

from

markitdown

import

MarkItDown

from

pathlib

import

Path

def

batch_convert

(

input_dir

,

output_dir

)

:

"""Convert all Office files to Markdown."""

md

=

MarkItDown

(

)

input_path

=

Path

(

input_dir

)

output_path

=

Path

(

output_dir

)

output_path

.

mkdir

(

exist_ok

=

True

)

extensions

=

[

'.docx'

,

'.xlsx'

,

'.pptx'

,

'.pdf'

]

for

ext

in

extensions

:

for

file

in

input_path

.

glob

(

f'*

{

ext

}

'

)

:

try

:

result

=

md

.

convert

(

str

(

file

)

output_file

=

output_path

/

f"

{

file

.

stem

}

.md"

with

open

(

output_file

,

'w'

)

as

f

:

f

.

write

(

result

.

text_content

)

print

(

f"Converted:

{

file

.

name

}

"

)

except

Exception

as

e

:

print

(

f"Error converting

{

file

.

name

}

:

{

e

}

"

)

batch_convert

(

'./documents'

,

'./markdown'

)

Best Practices

Check Output Quality

Review converted Markdown for accuracy

Handle Tables

Complex tables may need manual adjustment

Preserve Structure

Use consistent heading levels in source docs

Image Handling

Consider using vision models for important images
Version Control: Store converted Markdown in Git for tracking Common Patterns Document Archive import os from datetime import datetime from markitdown import MarkItDown def archive_document ( doc_path , archive_dir ) : """Convert and archive Office document to Markdown.""" md = MarkItDown ( ) result = md . convert ( doc_path )

Create archive structure

date_str

datetime . now ( ) . strftime ( '%Y-%m-%d' ) filename = os . path . basename ( doc_path ) base_name = os . path . splitext ( filename ) [ 0 ]

Save with metadata

output_content

f"""--- source: { filename } converted: { date_str }

{ result . text_content } """ output_path = os . path . join ( archive_dir , f" { base_name } .md" ) with open ( output_path , 'w' ) as f : f . write ( output_content ) return output_path AI-Ready Corpus from markitdown import MarkItDown from pathlib import Path import json def create_ai_corpus ( doc_folder , output_file ) : """Convert documents to JSON corpus for AI training/RAG.""" md = MarkItDown ( ) corpus = [ ] for doc in Path ( doc_folder ) . glob ( '*/' ) : if doc . suffix in [ '.docx' , '.pdf' , '.pptx' , '.xlsx' ] : try : result = md . convert ( str ( doc ) ) corpus . append ( { 'source' : str ( doc ) , 'filename' : doc . name , 'content' : result . text_content , 'type' : doc . suffix [ 1 : ] } ) except Exception as e : print ( f"Skipped { doc . name } : { e } " ) with open ( output_file , 'w' ) as f : json . dump ( corpus , f , indent = 2 ) print ( f"Created corpus with { len ( corpus ) } documents" ) return corpus Examples Example 1: Convert Documentation Suite from markitdown import MarkItDown from pathlib import Path def convert_docs_to_wiki ( docs_folder , wiki_folder ) : """Convert all Office docs to markdown wiki structure.""" md = MarkItDown ( ) docs_path = Path ( docs_folder ) wiki_path = Path ( wiki_folder )

Create wiki structure

wiki_path . mkdir ( exist_ok = True )

Create index

index_content

"# Documentation Index\n\n" for doc in sorted ( docs_path . glob ( '*/.docx' ) ) : try : result = md . convert ( str ( doc ) )

Create relative path in wiki

rel_path

doc . relative_to ( docs_path ) output_file = wiki_path / rel_path . with_suffix ( '.md' ) output_file . parent . mkdir ( parents = True , exist_ok = True )

Write markdown

with open ( output_file , 'w' ) as f : f . write ( result . text_content )

Add to index

link

str ( rel_path . with_suffix ( '.md' ) ) . replace ( '\' , '/' ) index_content += f"- { doc . stem } \n" print ( f"Converted: { doc . name } " ) except Exception as e : print ( f"Error: { doc . name } - { e } " )

Write index

with open ( wiki_path / 'index.md' , 'w' ) as f : f . write ( index_content ) convert_docs_to_wiki ( './company_docs' , './wiki' ) Example 2: Meeting Notes Processor from markitdown import MarkItDown import re from datetime import datetime def process_meeting_notes ( pptx_path ) : """Extract and structure meeting notes from PowerPoint.""" md = MarkItDown ( ) result = md . convert ( pptx_path )

Parse the markdown

content

result . text_content

Extract sections

sections

{ 'attendees' : [ ] , 'agenda' : [ ] , 'decisions' : [ ] , 'action_items' : [ ] } current_section = None for line in content . split ( '\n' ) : line_lower = line . lower ( ) if 'attendee' in line_lower or 'participant' in line_lower : current_section = 'attendees' elif 'agenda' in line_lower : current_section = 'agenda' elif 'decision' in line_lower : current_section = 'decisions' elif 'action' in line_lower : current_section = 'action_items' elif line . strip ( ) . startswith ( ( '-' , '*' , '•' ) ) and current_section : sections [ current_section ] . append ( line . strip ( ) [ 1 : ] . strip ( ) )

Generate structured output

output

f"""# Meeting Notes Date: { datetime . now ( ) . strftime ( '%Y-%m-%d' ) } Source: { pptx_path }

Attendees

{ chr ( 10 ) . join ( '- ' + a for a in sections [ 'attendees' ] ) }

Agenda

{ chr ( 10 ) . join ( '- ' + a for a in sections [ 'agenda' ] ) }

Decisions Made

{ chr ( 10 ) . join ( '- ' + d for d in sections [ 'decisions' ] ) }

Action Items

{ chr ( 10 ) . join ( '- [ ] ' + a for a in sections [ 'action_items' ] ) } """ return output notes = process_meeting_notes ( 'team_meeting.pptx' ) print ( notes ) Example 3: Excel to Documentation from markitdown import MarkItDown def excel_to_data_dictionary ( xlsx_path ) : """Convert Excel data model to data dictionary documentation.""" md = MarkItDown ( ) result = md . convert ( xlsx_path )

Add documentation structure

doc

f"""# Data Dictionary Generated from: { xlsx_path } { result . text_content }

Usage Notes

All tables are derived from the source Excel file
Review data types and constraints before use
Contact data team for clarifications

Change Log

Date	Change	Author

{
datetime
.
now
(
)
.
strftime
(
'%Y-%m-%d'
)
}
Initial generation	Auto
"""
return
doc
documentation
=
excel_to_data_dictionary
(
'data_model.xlsx'
)
with
open
(
'data_dictionary.md'
,
'w'
)
as
f
:
f
.
write
(
documentation
)
Limitations
Complex formatting may be simplified
Images are not embedded (use vision model for descriptions)
Some table structures may not convert perfectly
Track changes in Word are not preserved
Comments may not be extracted
Installation
pip
install
markitdown
# For image/audio processing
pip
install
markitdown
[
all
]
# For specific features
pip
install
markitdown
[
images
]
# Image OCR
pip
install
markitdown
[
audio
]
# Audio transcription
Resources
GitHub Repository
PyPI Package
Supported Formats

安装

Initialize converter

md

Convert file

result

Save to file

Simple conversion

md

Access content

markdown_text

With options

md

Optional LLM for enhanced processing

llm_model

Model name if using LLM

Install

Convert file

Or with output file

Convert Word document

result

Output preserves:

- Headings (as # headers)

- Bold/italic formatting

- Lists (bulleted and numbered)

- Tables (as markdown tables)

- Hyperlinks

|

|

|

Each sheet becomes a section

Data becomes markdown tables

|

|

|

|

|

|

Each slide becomes a section

Speaker notes included if present

Key Points

Innovation first

Customer focused

Global reach

Extracts text content

Tables converted where detected

Use Claude for image description

client

Output: Description of the image content

Create archive structure

date_str

Save with metadata

output_content

Create wiki structure

Create index

index_content

Create relative path in wiki

rel_path

Write markdown

Add to index

link

Write index

Parse the markdown

content

Extract sections

sections

Generate structured output

output

Attendees

Agenda

Decisions Made

Action Items

Add documentation structure

doc

Usage Notes

Change Log