SQLite Truncated Database Recovery This skill provides systematic approaches for recovering data from SQLite database files that have been corrupted through binary truncation. It emphasizes understanding the SQLite file format before attempting recovery and avoiding common pitfalls that lead to multiple failed iterations. When to Use This Skill This skill applies when: A SQLite database file cannot be opened with standard sqlite3 commands The database error indicates corruption or malformed data File size is smaller than expected (suggesting truncation) Standard recovery tools like .recover command fail Manual binary parsing of SQLite page structure is required Initial Assessment Strategy Before writing any recovery code, perform a thorough analysis of the corrupted file: Step 1: Examine File Characteristics

Check file size and basic properties

ls -lh database.db file database.db

Create hex dump for analysis

hexdump

-C

database.db

|

head

-100

Key observations to make:

File size

SQLite pages are typically 4096 bytes. Check if size aligns with page boundaries

Magic bytes

Valid SQLite files start with "SQLite format 3\000" (16 bytes)

First byte after header

Identifies page type (0x0d = table leaf page with actual data)

Step 2: Identify Corruption Pattern

Common truncation scenarios:

Header-only file

Only the 100-byte header remains

Missing header

File starts with a data page (first byte is 0x0d, 0x05, 0x0a, or 0x02)
Partial page: File ends mid-page, truncating some cells If the file lacks the standard "SQLite format 3" magic header but starts with 0x0d, this indicates the file contains only a table leaf page without the database header. Step 3: Try Standard Tools First Always attempt standard recovery before manual parsing:

Check if sqlite3 can read the file

sqlite3 database.db ".schema" 2

&1 sqlite3 database.db "SELECT * FROM sqlite_master" 2

&1

Try built-in recovery

sqlite3 database.db ".recover"

recovered.sql 2

&1

Try integrity check

sqlite3 database.db "PRAGMA integrity_check;" If these fail with "database disk image is malformed" or similar errors, proceed to manual binary parsing. SQLite Page Structure Overview Understanding the page structure is essential before writing recovery code. Table Leaf Page Layout (Page Type 0x0d) Offset Size Description ------ ---- ----------- 0 1 Page type (0x0d for table leaf) 1 2 First freeblock offset (big-endian) 3 2 Number of cells on page (big-endian) 5 2 Cell content area start offset (big-endian) 7 1 Fragmented free bytes count 8+ varies Cell pointer array (2 bytes per cell, big-endian) ... [Gap/free space] End Cell data (grows backward from page end) Cell Structure Each cell contains a database row: [Payload size: varint] [Row ID: varint] [Header size: varint] [Serial type 1: varint] [Serial type 2: varint] ... [Column 1 value] [Column 2 value] ... Varint Encoding SQLite uses variable-length integers (varints): Bytes 1-8: Use 7 bits for data, high bit (0x80) indicates continuation Byte 9: Uses all 8 bits (no continuation) Serial Types Serial types indicate how to interpret column data: Type Size Meaning 0 0 NULL 1 1 8-bit signed integer 2 2 16-bit big-endian signed integer 3 3 24-bit big-endian signed integer 4 4 32-bit big-endian signed integer 7 8 IEEE 754 64-bit float (big-endian) 8 0 Integer constant 0 9 0 Integer constant 1 N >= 12, even (N-12)/2 BLOB N >= 13, odd (N-13)/2 Text string (UTF-8) Example: Serial type 0x21 (33) = text string of length (33-13)/2 = 10 bytes. Recovery Approach Build a Single, Modular Script Avoid creating multiple separate recovery scripts. Instead, build one script iteratively with clear debug output: import struct import json DEBUG = True def read_varint ( data , offset ) : """Read SQLite variable-length integer.""" value = 0 for i in range ( 9 ) : if offset + i

= len ( data ) : return None , offset byte = data [ offset + i ] if i == 8 : value = ( value << 8 ) | byte return value , offset + i + 1 value = ( value << 7 ) | ( byte & 0x7f ) if ( byte & 0x80 ) == 0 : return value , offset + i + 1 return value , offset def decode_value ( data , offset , serial_type ) : """Decode value based on serial type.""" if serial_type == 0 : return None , offset elif serial_type == 1 : return struct . unpack ( '>b' , data [ offset : offset + 1 ] ) [ 0 ] , offset + 1 elif serial_type == 2 : return struct . unpack ( '>h' , data [ offset : offset + 2 ] ) [ 0 ] , offset + 2 elif serial_type == 4 : return struct . unpack ( '>i' , data [ offset : offset + 4 ] ) [ 0 ] , offset + 4 elif serial_type == 7 : return struct . unpack ( '>d' , data [ offset : offset + 8 ] ) [ 0 ] , offset + 8 elif serial_type == 8 : return 0 , offset elif serial_type == 9 : return 1 , offset elif serial_type = 12 : if serial_type % 2 == 0 : length = ( serial_type - 12 ) // 2 return data [ offset : offset + length ] , offset + length else : length = ( serial_type - 13 ) // 2 return data [ offset : offset + length ] . decode ( 'utf-8' , errors = 'replace' ) , offset + length return None , offset Parse Incrementally with Debug Output Parse one cell completely and verify before processing all cells: def parse_cell ( data , cell_offset , debug = DEBUG ) : """Parse a single cell with detailed debug output.""" if debug : print ( f"\nParsing cell at offset { cell_offset } (0x { cell_offset : 04x } )" )

Read payload size

payload_size , offset = read_varint ( data , cell_offset ) if debug : print ( f" Payload size: { payload_size } " )

Read row ID

row_id , offset = read_varint ( data , offset ) if debug : print ( f" Row ID: { row_id } " )

Read header size

header_size , header_start = read_varint ( data , offset ) if debug : print ( f" Header size: { header_size } " )

Parse serial types

serial_types

[ ] current = header_start header_end = offset + header_size while current < header_end : st , current = read_varint ( data , current ) serial_types . append ( st ) if debug : print ( f" Serial types: { serial_types } " )

Parse values

values

[

]

for

st

in

serial_types

:

val

,

current

=

decode_value

(

data

,

current

,

st

)

values

.

append

(

val

)

if

debug

:

print

(

f" Values:

{

values

}

"

)

return

{

'row_id'

:

row_id

,

'values'

:

values

}

Common Pitfalls and Prevention

Pitfall 1: Not Understanding the Corruption Pattern

Mistake

Assuming the file has a standard SQLite header when it may only contain a data page.

Prevention

Always examine the first few bytes with hexdump. If the file starts with 0x0d instead of "SQLite format 3", the header is missing. Adjust parsing offsets accordingly (no 100-byte header offset needed).

Pitfall 2: Multiple Script Iterations

Mistake

Creating many separate recovery scripts (recover1.py, recover2.py, etc.) based on trial and error.

Prevention

:

Read the hex dump thoroughly first and annotate the structure manually

Build one script with debug flags

Reference the SQLite file format specification before coding

Pitfall 3: Reading Strings Beyond Their Boundaries

Mistake

Reading string data without checking the serial type length, resulting in incorrect strings (e.g., "testword052" instead of "testword05").

Prevention

Always calculate string length from serial type:

length = (serial_type - 13) // 2

. Read exactly that many bytes.

Pitfall 4: Syntax Errors in Generated Code

Mistake

Missing spaces in operators like

if48

instead of

if 48

, or

12and

instead of

12 and

.

Prevention

Validate syntax before running:

python3

-m

py_compile recovery_script.py

Pitfall 5: Wrong Byte Order

Mistake

Reading multi-byte integers with little-endian instead of big-endian.

Prevention

SQLite uses big-endian for all multi-byte integers. Always use

struct.unpack('>...', data)

with the

>

prefix.

Pitfall 6: Not Handling Truncation Gracefully

Mistake

Script crashes when encountering truncated data at end of file.
Prevention: Check bounds before every read operation: def safe_read ( data , offset , length ) : if offset + length

len ( data ) : return None return data [ offset : offset + length ] Verification Strategy Step 1: Validate Cell Count Compare the number of cells reported in the page header (offset 3-4) with actual cells found. Step 2: Validate Data Patterns If expected patterns are known (e.g., words matching "testwordXY"), verify extracted strings match the pattern. Step 3: Check Value Ranges Verify extracted numeric values are within expected ranges. Watch for: Unexpected negative numbers (sign bit interpretation) Very large numbers (byte order issues) NaN or infinity for floats Step 4: Compare with Expected Output Format Before finalizing output, ensure JSON structure matches requirements:

Validate output structure

for record in recovered_data : assert 'word' in record and 'value' in record assert isinstance ( record [ 'word' ] , str ) assert isinstance ( record [ 'value' ] , ( int , float ) ) Output Generation Format recovered data according to the required output specification: def generate_output ( recovered_rows , output_path ) : """Format and save recovered data.""" results = [ ] for row in recovered_rows : if len ( row [ 'values' ] )

= 2 : results . append ( { 'word' : row [ 'values' ] [ 0 ] , 'value' : row [ 'values' ] [ 1 ] } ) with open ( output_path , 'w' ) as f : json . dump ( results , f , indent = 2 ) print ( f"Recovered { len ( results ) } records to { output_path } " ) return results Summary Checklist Before writing recovery code: Examined file with hexdump to understand corruption extent Identified whether header is present or missing Tried standard SQLite tools first Reviewed SQLite file format specification During implementation: Using a single script with debug output (not multiple scripts) Validated Python syntax before running Using big-endian byte order for all multi-byte integers Calculating string lengths from serial types Handling truncation with bounds checking After recovery: Verified cell count matches expectation Validated string patterns if known Checked numeric value ranges Confirmed output format matches requirements

sqlite-db-truncate

安装

Check file size and basic properties

Create hex dump for analysis

Check if sqlite3 can read the file

Try built-in recovery

Try integrity check

Read payload size

Read row ID

Read header size

Parse serial types

serial_types

Parse values

values

Validate output structure