Table Extractor Skill Overview This skill enables precise extraction of tables from PDF documents using camelot - the gold standard for PDF table extraction. Handle complex tables with merged cells, borderless tables, and multi-page layouts with high accuracy. How to Use Provide the PDF containing tables Optionally specify pages or table detection method I'll extract tables as pandas DataFrames Example prompts: "Extract all tables from this PDF" "Get the table on page 5 of this report" "Extract borderless tables from this document" "Convert PDF tables to Excel format" Domain Knowledge camelot Fundamentals import camelot
Extract tables from PDF
tables
camelot . read_pdf ( 'document.pdf' )
Access results
print ( f"Found { len ( tables ) } tables" )
Get first table as DataFrame
df
tables [ 0 ] . df print ( df ) Extraction Methods Method Use Case Description lattice Bordered tables Detects table by lines/borders stream Borderless tables Uses text positioning
Lattice method (default) - for tables with visible borders
tables
camelot . read_pdf ( 'document.pdf' , flavor = 'lattice' )
Stream method - for borderless tables
tables
camelot . read_pdf ( 'document.pdf' , flavor = 'stream' ) Page Selection
Single page
tables
camelot . read_pdf ( 'document.pdf' , pages = '1' )
Multiple pages
tables
camelot . read_pdf ( 'document.pdf' , pages = '1,3,5' )
Page range
tables
camelot . read_pdf ( 'document.pdf' , pages = '1-5' )
All pages
tables
camelot . read_pdf ( 'document.pdf' , pages = 'all' ) Advanced Options Lattice Options tables = camelot . read_pdf ( 'document.pdf' , flavor = 'lattice' , line_scale = 40 ,
Line detection sensitivity
copy_text
[ 'h' , 'v' ] ,
Copy text across merged cells
shift_text
[ 'l' , 't' ] ,
Shift text alignment
split_text
True ,
Split text at newlines
flag_size
True ,
Flag super/subscripts
strip_text
'\n' ,
Characters to strip
process_background
False ,
Process background lines
) Stream Options tables = camelot . read_pdf ( 'document.pdf' , flavor = 'stream' , edge_tol = 500 ,
Edge tolerance
row_tol
10 ,
Row tolerance
column_tol
0 ,
Column tolerance
strip_text
'\n' ,
Characters to strip
) Table Area Specification
Extract from specific area (x1, y1, x2, y2)
Coordinates from bottom-left, in PDF points (72 points = 1 inch)
tables
camelot . read_pdf ( 'document.pdf' , table_areas = [ '72,720,540,400' ] ,
One area
)
Multiple areas
tables
camelot . read_pdf ( 'document.pdf' , table_areas = [ '72,720,540,400' , '72,380,540,200' ] , ) Column Specification
Manually specify column positions (for stream method)
tables
camelot . read_pdf ( 'document.pdf' , flavor = 'stream' , columns = [ '100,200,300,400' ] ,
X positions of column separators
) Working with Results import camelot tables = camelot . read_pdf ( 'document.pdf' ) for i , table in enumerate ( tables ) :
Access DataFrame
df
table . df
Table metadata
print ( f"Table { i + 1 } :" ) print ( f" Page: { table . page } " ) print ( f" Accuracy: { table . accuracy } " ) print ( f" Whitespace: { table . whitespace } " ) print ( f" Order: { table . order } " ) print ( f" Shape: { df . shape } " )
Parsing report
report
table . parsing_report print ( f" Report: { report } " ) Export Options import camelot tables = camelot . read_pdf ( 'document.pdf' )
Export to CSV
tables [ 0 ] . to_csv ( 'table.csv' )
Export to Excel
tables [ 0 ] . to_excel ( 'table.xlsx' )
Export to JSON
tables [ 0 ] . to_json ( 'table.json' )
Export to HTML
tables [ 0 ] . to_html ( 'table.html' )
Export all tables
for i , table in enumerate ( tables ) : table . to_excel ( f'table_ { i + 1 } .xlsx' ) Visual Debugging import camelot
Enable visual debugging
tables
camelot . read_pdf ( 'document.pdf' )
Plot detected table areas
camelot . plot ( tables [ 0 ] , kind = 'contour' ) . show ( )
Plot text on table
camelot . plot ( tables [ 0 ] , kind = 'text' ) . show ( )
Plot detected lines (lattice only)
camelot . plot ( tables [ 0 ] , kind = 'joint' ) . show ( ) camelot . plot ( tables [ 0 ] , kind = 'line' ) . show ( )
Save plot
fig
camelot . plot ( tables [ 0 ] ) fig . savefig ( 'debug.png' ) Handling Multi-page Tables import camelot import pandas as pd def extract_multipage_table ( pdf_path , pages = 'all' ) : """Extract and combine tables that span multiple pages.""" tables = camelot . read_pdf ( pdf_path , pages = pages )
Group tables by similar structure (columns)
table_groups
{ } for table in tables : cols = tuple ( table . df . columns ) if cols not in table_groups : table_groups [ cols ] = [ ] table_groups [ cols ] . append ( table . df )
Combine similar tables
combined
[ ] for cols , dfs in table_groups . items ( ) : if len ( dfs )
1 :
Combine and deduplicate header rows
combined_df
- pd
- .
- concat
- (
- dfs
- ,
- ignore_index
- =
- True
- )
- combined
- .
- append
- (
- combined_df
- )
- else
- :
- combined
- .
- append
- (
- dfs
- [
- 0
- ]
- )
- return
- combined
- Best Practices
- Try Both Methods
-
- Lattice for bordered, stream for borderless
- Check Accuracy Score
-
- Above 90% is usually good
- Use Visual Debugging
-
- Understand extraction results
- Specify Areas
-
- For PDFs with multiple table types
- Handle Headers
- First row often needs special treatment Common Patterns Batch Table Extraction import camelot from pathlib import Path import pandas as pd def batch_extract_tables ( input_dir , output_dir ) : """Extract tables from all PDFs in directory.""" input_path = Path ( input_dir ) output_path = Path ( output_dir ) output_path . mkdir ( exist_ok = True ) results = [ ] for pdf_file in input_path . glob ( '*.pdf' ) : try : tables = camelot . read_pdf ( str ( pdf_file ) , pages = 'all' ) for i , table in enumerate ( tables ) :
Skip low accuracy tables
if table . accuracy < 80 : continue output_file = output_path / f" { pdf_file . stem } table { i + 1 } .xlsx" table . to_excel ( str ( output_file ) ) results . append ( { 'source' : str ( pdf_file ) , 'table' : i + 1 , 'page' : table . page , 'accuracy' : table . accuracy , 'output' : str ( output_file ) } ) except Exception as e : results . append ( { 'source' : str ( pdf_file ) , 'error' : str ( e ) } ) return results Auto-detect Table Method import camelot def smart_extract_tables ( pdf_path , pages = '1' ) : """Try both methods and return best results."""
Try lattice first
lattice_tables
camelot . read_pdf ( pdf_path , pages = pages , flavor = 'lattice' )
Try stream
stream_tables
camelot . read_pdf ( pdf_path , pages = pages , flavor = 'stream' )
Compare and return best
results
[ ] if lattice_tables and lattice_tables [ 0 ] . accuracy
70 : results . extend ( lattice_tables ) elif stream_tables : results . extend ( stream_tables ) return results Examples Example 1: Financial Statement Extraction import camelot import pandas as pd def extract_financial_tables ( pdf_path ) : """Extract financial tables from annual report."""
Extract all tables
tables
camelot . read_pdf ( pdf_path , pages = 'all' , flavor = 'lattice' ) financial_data = { 'income_statement' : None , 'balance_sheet' : None , 'cash_flow' : None , 'other_tables' : [ ] } for table in tables : df = table . df text = df . to_string ( ) . lower ( )
Identify table type
if 'revenue' in text or 'sales' in text : if 'operating income' in text or 'net income' in text : financial_data [ 'income_statement' ] = df elif 'asset' in text and 'liabilities' in text : financial_data [ 'balance_sheet' ] = df elif 'cash flow' in text or 'operating activities' in text : financial_data [ 'cash_flow' ] = df else : financial_data [ 'other_tables' ] . append ( { 'page' : table . page , 'data' : df , 'accuracy' : table . accuracy } ) return financial_data financials = extract_financial_tables ( 'annual_report.pdf' ) if financials [ 'income_statement' ] is not None : print ( "Income Statement found:" ) print ( financials [ 'income_statement' ] ) Example 2: Scientific Data Extraction import camelot import pandas as pd def extract_research_data ( pdf_path , pages = 'all' ) : """Extract data tables from research paper."""
Try lattice for bordered tables
tables
camelot . read_pdf ( pdf_path , pages = pages , flavor = 'lattice' ) if not tables or all ( t . accuracy < 70 for t in tables ) :
Fall back to stream for borderless
tables
camelot . read_pdf ( pdf_path , pages = pages , flavor = 'stream' ) extracted_data = [ ] for table in tables : df = table . df
Clean up the DataFrame
Set first row as header if it looks like one
if not df . iloc [ 0 ] . str . contains ( r'\d' ) . any ( ) : df . columns = df . iloc [ 0 ] df = df [ 1 : ] df = df . reset_index ( drop = True ) extracted_data . append ( { 'page' : table . page , 'accuracy' : table . accuracy , 'data' : df } ) return extracted_data data = extract_research_data ( 'research_paper.pdf' ) for i , item in enumerate ( data ) : print ( f"Table { i + 1 } (Page { item [ 'page' ] } , Accuracy: { item [ 'accuracy' ] } %):" ) print ( item [ 'data' ] . head ( ) ) Example 3: Invoice Line Items import camelot def extract_invoice_items ( pdf_path ) : """Extract line items from invoice."""
Usually invoices have bordered tables
tables
camelot . read_pdf ( pdf_path , flavor = 'lattice' ) line_items = [ ] for table in tables : df = table . df
Look for table with typical invoice columns
header_text
' ' . join ( df . iloc [ 0 ] . astype ( str ) ) . lower ( ) if any ( term in header_text for term in [ 'quantity' , 'qty' , 'amount' , 'price' , 'description' ] ) :
This looks like a line items table
df . columns = df . iloc [ 0 ] df = df [ 1 : ] for _ , row in df . iterrows ( ) : item = { } for col in df . columns : col_lower = str ( col ) . lower ( ) value = row [ col ] if 'desc' in col_lower or 'item' in col_lower : item [ 'description' ] = value elif 'qty' in col_lower or 'quantity' in col_lower : item [ 'quantity' ] = value elif 'price' in col_lower or 'rate' in col_lower : item [ 'unit_price' ] = value elif 'amount' in col_lower or 'total' in col_lower : item [ 'amount' ] = value if item : line_items . append ( item ) return line_items items = extract_invoice_items ( 'invoice.pdf' ) for item in items : print ( item ) Example 4: Table Comparison import camelot import pandas as pd def compare_pdf_tables ( pdf1_path , pdf2_path ) : """Compare tables between two PDF versions.""" tables1 = camelot . read_pdf ( pdf1_path ) tables2 = camelot . read_pdf ( pdf2_path ) comparisons = [ ]
Match tables by shape and position
for t1 in tables1 : best_match = None best_score = 0 for t2 in tables2 : if t1 . df . shape == t2 . df . shape :
Calculate similarity
try : similarity = ( t1 . df == t2 . df ) . mean ( ) . mean ( ) if similarity
best_score : best_score = similarity best_match = t2 except : pass if best_match : comparisons . append ( { 'page1' : t1 . page , 'page2' : best_match . page , 'similarity' : best_score , 'identical' : best_score == 1.0 , 'diff' : pd . DataFrame ( t1 . df != best_match . df ) } ) return comparisons comparison = compare_pdf_tables ( 'report_v1.pdf' , 'report_v2.pdf' ) Limitations Encrypted PDFs not supported Image-based PDFs need OCR preprocessing Very complex merged cells may need tuning Rotated tables require preprocessing Large PDFs may need page-by-page processing Installation pip install camelot-py [ cv ]
Additional dependencies
macOS
brew install ghostscript tcl-tk
Ubuntu
apt-get install ghostscript python3-tk Resources camelot Documentation GitHub Repository Comparison with Other Tools