- ELF Binary Data Extraction
- This skill provides guidance for tasks involving extraction of data from ELF binary files, including reading headers, parsing segments, and converting binary content to structured output formats.
- Approach Overview
- ELF extraction tasks typically require:
- Parsing the ELF header to understand file structure
- Reading program headers to identify LOAD segments
- Extracting data from segments at correct virtual addresses
- Converting binary data to the required output format
- Implementation Steps
- Step 1: Validate ELF Header
- Before processing, verify the file is a valid ELF binary:
- Check magic bytes at offset 0:
- 0x7F 'E' 'L' 'F'
- (hex:
- 7f 45 4c 46
- )
- Identify ELF class (32-bit vs 64-bit) at offset 4
- Identify endianness at offset 5 (1 = little-endian, 2 = big-endian)
- Step 2: Parse ELF Header Fields
- Extract key header fields based on ELF class:
- For 32-bit ELF:
- Program header offset: bytes 28-31
- Program header entry size: bytes 42-43
- Number of program headers: bytes 44-45
- For 64-bit ELF:
- Program header offset: bytes 32-39
- Program header entry size: bytes 54-55
- Number of program headers: bytes 56-57
- Step 3: Process Program Headers
- Iterate through program headers and identify LOAD segments (type = 1):
- Extract virtual address (p_vaddr)
- Extract file offset (p_offset)
- Extract file size (p_filesz)
- Extract memory size (p_memsz)
- Step 4: Extract Segment Data
- For each LOAD segment:
- Read data from file at p_offset
- Map data to virtual addresses starting at p_vaddr
- Handle alignment and padding as specified
- Critical Data Type Considerations
- Signed vs Unsigned Integers
- This is the most common source of errors in binary extraction tasks.
- When reading multi-byte integer values from binary data:
- Memory addresses are
- always unsigned
- Size fields are
- always unsigned
- Data values should typically be read as
- unsigned
- unless the task explicitly requires signed interpretation
- Common API distinctions:
- Node.js Buffer:
- readUInt32LE
- vs
- readInt32LE
- Python struct:
- 'I'
- (unsigned) vs
- 'i'
- (signed)
- C/C++:
- uint32_t
- vs
- int32_t
- Verification
-
- If output contains negative numbers but the expected output shows only positive integers, the wrong signedness was used.
- Endianness
- Match the endianness specified in the ELF header:
- Little-endian (most common on x86/x64): Use
- LE
- variants
- Big-endian: Use
- BE
- variants
- Integer Sizes
- ELF fields vary by class:
- 32-bit ELF: addresses and offsets are 4 bytes
- 64-bit ELF: addresses and offsets are 8 bytes
- Verification Strategies
- Before Declaring Success
- Validate output format
-
- Ensure JSON is well-formed, keys are correct types
- Check address ranges
-
- Verify addresses fall within expected segment boundaries
- Sample value verification
- Manually compute expected values for a few addresses using hex inspection tools Manual Verification Commands Use these tools to verify extracted values:
View ELF header information
readelf -h < binary
View program headers (segments)
readelf -l < binary
Dump section contents in hex
objdump -s < binary
View raw hex bytes at specific offset
xxd -s < offset
-l < length
< binary
Calculate expected value from hex bytes (little-endian example)
For bytes: 41 42 43 44 -> value = 0x44434241 = 1145258561
Value Sanity Checks If the example output shows only positive integers, verify output contains no negative values Compare a few computed values against manual hex calculation Verify address coverage matches expected segment ranges Common Pitfalls Using signed integer reads for unsigned data - Results in negative numbers for values with high bit set (e.g., -98693133 instead of 4196274163) Incorrect endianness handling - Produces completely wrong values; verify against ELF header byte 5 Off-by-one errors in segment boundaries - Carefully track whether sizes are inclusive/exclusive Assuming 4-byte alignment - Check if segment sizes are multiples of the read size; handle partial reads at boundaries Mixing 32-bit and 64-bit field sizes - Always check ELF class and use appropriate field sizes Overconfidence without verification - Never assume "values are read directly from binary, so they should match" - always verify sample values manually Output Format Considerations When producing structured output (e.g., JSON): Use string keys for addresses if they need to be JSON object keys (JSON requires string keys) Ensure integer values are within JavaScript/JSON safe integer range (2^53 - 1 for full precision) Consider whether addresses should be decimal or hexadecimal strings based on task requirements