table-extractor

安装量: 180
排名: #4777

安装

npx skills add https://github.com/claude-office-skills/skills --skill table-extractor

Table Extractor Skill Overview This skill enables precise extraction of tables from PDF documents using camelot - the gold standard for PDF table extraction. Handle complex tables with merged cells, borderless tables, and multi-page layouts with high accuracy. How to Use Provide the PDF containing tables Optionally specify pages or table detection method I'll extract tables as pandas DataFrames Example prompts: "Extract all tables from this PDF" "Get the table on page 5 of this report" "Extract borderless tables from this document" "Convert PDF tables to Excel format" Domain Knowledge camelot Fundamentals import camelot

Extract tables from PDF

tables

camelot . read_pdf ( 'document.pdf' )

Access results

print ( f"Found { len ( tables ) } tables" )

Get first table as DataFrame

df

tables [ 0 ] . df print ( df ) Extraction Methods Method Use Case Description lattice Bordered tables Detects table by lines/borders stream Borderless tables Uses text positioning

Lattice method (default) - for tables with visible borders

tables

camelot . read_pdf ( 'document.pdf' , flavor = 'lattice' )

Stream method - for borderless tables

tables

camelot . read_pdf ( 'document.pdf' , flavor = 'stream' ) Page Selection

Single page

tables

camelot . read_pdf ( 'document.pdf' , pages = '1' )

Multiple pages

tables

camelot . read_pdf ( 'document.pdf' , pages = '1,3,5' )

Page range

tables

camelot . read_pdf ( 'document.pdf' , pages = '1-5' )

All pages

tables

camelot . read_pdf ( 'document.pdf' , pages = 'all' ) Advanced Options Lattice Options tables = camelot . read_pdf ( 'document.pdf' , flavor = 'lattice' , line_scale = 40 ,

Line detection sensitivity

copy_text

[ 'h' , 'v' ] ,

Copy text across merged cells

shift_text

[ 'l' , 't' ] ,

Shift text alignment

split_text

True ,

Split text at newlines

flag_size

True ,

Flag super/subscripts

strip_text

'\n' ,

Characters to strip

process_background

False ,

Process background lines

) Stream Options tables = camelot . read_pdf ( 'document.pdf' , flavor = 'stream' , edge_tol = 500 ,

Edge tolerance

row_tol

10 ,

Row tolerance

column_tol

0 ,

Column tolerance

strip_text

'\n' ,

Characters to strip

) Table Area Specification

Extract from specific area (x1, y1, x2, y2)

Coordinates from bottom-left, in PDF points (72 points = 1 inch)

tables

camelot . read_pdf ( 'document.pdf' , table_areas = [ '72,720,540,400' ] ,

One area

)

Multiple areas

tables

camelot . read_pdf ( 'document.pdf' , table_areas = [ '72,720,540,400' , '72,380,540,200' ] , ) Column Specification

Manually specify column positions (for stream method)

tables

camelot . read_pdf ( 'document.pdf' , flavor = 'stream' , columns = [ '100,200,300,400' ] ,

X positions of column separators

) Working with Results import camelot tables = camelot . read_pdf ( 'document.pdf' ) for i , table in enumerate ( tables ) :

Access DataFrame

df

table . df

Table metadata

print ( f"Table { i + 1 } :" ) print ( f" Page: { table . page } " ) print ( f" Accuracy: { table . accuracy } " ) print ( f" Whitespace: { table . whitespace } " ) print ( f" Order: { table . order } " ) print ( f" Shape: { df . shape } " )

Parsing report

report

table . parsing_report print ( f" Report: { report } " ) Export Options import camelot tables = camelot . read_pdf ( 'document.pdf' )

Export to CSV

tables [ 0 ] . to_csv ( 'table.csv' )

Export to Excel

tables [ 0 ] . to_excel ( 'table.xlsx' )

Export to JSON

tables [ 0 ] . to_json ( 'table.json' )

Export to HTML

tables [ 0 ] . to_html ( 'table.html' )

Export all tables

for i , table in enumerate ( tables ) : table . to_excel ( f'table_ { i + 1 } .xlsx' ) Visual Debugging import camelot

Enable visual debugging

tables

camelot . read_pdf ( 'document.pdf' )

Plot detected table areas

camelot . plot ( tables [ 0 ] , kind = 'contour' ) . show ( )

Plot text on table

camelot . plot ( tables [ 0 ] , kind = 'text' ) . show ( )

Plot detected lines (lattice only)

camelot . plot ( tables [ 0 ] , kind = 'joint' ) . show ( ) camelot . plot ( tables [ 0 ] , kind = 'line' ) . show ( )

Save plot

fig

camelot . plot ( tables [ 0 ] ) fig . savefig ( 'debug.png' ) Handling Multi-page Tables import camelot import pandas as pd def extract_multipage_table ( pdf_path , pages = 'all' ) : """Extract and combine tables that span multiple pages.""" tables = camelot . read_pdf ( pdf_path , pages = pages )

Group tables by similar structure (columns)

table_groups

{ } for table in tables : cols = tuple ( table . df . columns ) if cols not in table_groups : table_groups [ cols ] = [ ] table_groups [ cols ] . append ( table . df )

Combine similar tables

combined

[ ] for cols , dfs in table_groups . items ( ) : if len ( dfs )

1 :

Combine and deduplicate header rows

combined_df

pd
.
concat
(
dfs
,
ignore_index
=
True
)
combined
.
append
(
combined_df
)
else
:
combined
.
append
(
dfs
[
0
]
)
return
combined
Best Practices
Try Both Methods
Lattice for bordered, stream for borderless
Check Accuracy Score
Above 90% is usually good
Use Visual Debugging
Understand extraction results
Specify Areas
For PDFs with multiple table types
Handle Headers
First row often needs special treatment Common Patterns Batch Table Extraction import camelot from pathlib import Path import pandas as pd def batch_extract_tables ( input_dir , output_dir ) : """Extract tables from all PDFs in directory.""" input_path = Path ( input_dir ) output_path = Path ( output_dir ) output_path . mkdir ( exist_ok = True ) results = [ ] for pdf_file in input_path . glob ( '*.pdf' ) : try : tables = camelot . read_pdf ( str ( pdf_file ) , pages = 'all' ) for i , table in enumerate ( tables ) :

Skip low accuracy tables

if table . accuracy < 80 : continue output_file = output_path / f" { pdf_file . stem } table { i + 1 } .xlsx" table . to_excel ( str ( output_file ) ) results . append ( { 'source' : str ( pdf_file ) , 'table' : i + 1 , 'page' : table . page , 'accuracy' : table . accuracy , 'output' : str ( output_file ) } ) except Exception as e : results . append ( { 'source' : str ( pdf_file ) , 'error' : str ( e ) } ) return results Auto-detect Table Method import camelot def smart_extract_tables ( pdf_path , pages = '1' ) : """Try both methods and return best results."""

Try lattice first

lattice_tables

camelot . read_pdf ( pdf_path , pages = pages , flavor = 'lattice' )

Try stream

stream_tables

camelot . read_pdf ( pdf_path , pages = pages , flavor = 'stream' )

Compare and return best

results

[ ] if lattice_tables and lattice_tables [ 0 ] . accuracy

70 : results . extend ( lattice_tables ) elif stream_tables : results . extend ( stream_tables ) return results Examples Example 1: Financial Statement Extraction import camelot import pandas as pd def extract_financial_tables ( pdf_path ) : """Extract financial tables from annual report."""

Extract all tables

tables

camelot . read_pdf ( pdf_path , pages = 'all' , flavor = 'lattice' ) financial_data = { 'income_statement' : None , 'balance_sheet' : None , 'cash_flow' : None , 'other_tables' : [ ] } for table in tables : df = table . df text = df . to_string ( ) . lower ( )

Identify table type

if 'revenue' in text or 'sales' in text : if 'operating income' in text or 'net income' in text : financial_data [ 'income_statement' ] = df elif 'asset' in text and 'liabilities' in text : financial_data [ 'balance_sheet' ] = df elif 'cash flow' in text or 'operating activities' in text : financial_data [ 'cash_flow' ] = df else : financial_data [ 'other_tables' ] . append ( { 'page' : table . page , 'data' : df , 'accuracy' : table . accuracy } ) return financial_data financials = extract_financial_tables ( 'annual_report.pdf' ) if financials [ 'income_statement' ] is not None : print ( "Income Statement found:" ) print ( financials [ 'income_statement' ] ) Example 2: Scientific Data Extraction import camelot import pandas as pd def extract_research_data ( pdf_path , pages = 'all' ) : """Extract data tables from research paper."""

Try lattice for bordered tables

tables

camelot . read_pdf ( pdf_path , pages = pages , flavor = 'lattice' ) if not tables or all ( t . accuracy < 70 for t in tables ) :

Fall back to stream for borderless

tables

camelot . read_pdf ( pdf_path , pages = pages , flavor = 'stream' ) extracted_data = [ ] for table in tables : df = table . df

Clean up the DataFrame

Set first row as header if it looks like one

if not df . iloc [ 0 ] . str . contains ( r'\d' ) . any ( ) : df . columns = df . iloc [ 0 ] df = df [ 1 : ] df = df . reset_index ( drop = True ) extracted_data . append ( { 'page' : table . page , 'accuracy' : table . accuracy , 'data' : df } ) return extracted_data data = extract_research_data ( 'research_paper.pdf' ) for i , item in enumerate ( data ) : print ( f"Table { i + 1 } (Page { item [ 'page' ] } , Accuracy: { item [ 'accuracy' ] } %):" ) print ( item [ 'data' ] . head ( ) ) Example 3: Invoice Line Items import camelot def extract_invoice_items ( pdf_path ) : """Extract line items from invoice."""

Usually invoices have bordered tables

tables

camelot . read_pdf ( pdf_path , flavor = 'lattice' ) line_items = [ ] for table in tables : df = table . df

Look for table with typical invoice columns

header_text

' ' . join ( df . iloc [ 0 ] . astype ( str ) ) . lower ( ) if any ( term in header_text for term in [ 'quantity' , 'qty' , 'amount' , 'price' , 'description' ] ) :

This looks like a line items table

df . columns = df . iloc [ 0 ] df = df [ 1 : ] for _ , row in df . iterrows ( ) : item = { } for col in df . columns : col_lower = str ( col ) . lower ( ) value = row [ col ] if 'desc' in col_lower or 'item' in col_lower : item [ 'description' ] = value elif 'qty' in col_lower or 'quantity' in col_lower : item [ 'quantity' ] = value elif 'price' in col_lower or 'rate' in col_lower : item [ 'unit_price' ] = value elif 'amount' in col_lower or 'total' in col_lower : item [ 'amount' ] = value if item : line_items . append ( item ) return line_items items = extract_invoice_items ( 'invoice.pdf' ) for item in items : print ( item ) Example 4: Table Comparison import camelot import pandas as pd def compare_pdf_tables ( pdf1_path , pdf2_path ) : """Compare tables between two PDF versions.""" tables1 = camelot . read_pdf ( pdf1_path ) tables2 = camelot . read_pdf ( pdf2_path ) comparisons = [ ]

Match tables by shape and position

for t1 in tables1 : best_match = None best_score = 0 for t2 in tables2 : if t1 . df . shape == t2 . df . shape :

Calculate similarity

try : similarity = ( t1 . df == t2 . df ) . mean ( ) . mean ( ) if similarity

best_score : best_score = similarity best_match = t2 except : pass if best_match : comparisons . append ( { 'page1' : t1 . page , 'page2' : best_match . page , 'similarity' : best_score , 'identical' : best_score == 1.0 , 'diff' : pd . DataFrame ( t1 . df != best_match . df ) } ) return comparisons comparison = compare_pdf_tables ( 'report_v1.pdf' , 'report_v2.pdf' ) Limitations Encrypted PDFs not supported Image-based PDFs need OCR preprocessing Very complex merged cells may need tuning Rotated tables require preprocessing Large PDFs may need page-by-page processing Installation pip install camelot-py [ cv ]

Additional dependencies

macOS

brew install ghostscript tcl-tk

Ubuntu

apt-get install ghostscript python3-tk Resources camelot Documentation GitHub Repository Comparison with Other Tools

返回排行榜