- Filter JavaScript from HTML
- Overview
- This skill provides guidance for tasks that require removing JavaScript and XSS attack vectors from HTML content while preserving the original formatting exactly. The key challenge is balancing comprehensive security filtering with format preservation.
- Critical Requirements Analysis
- Before implementation, identify and prioritize these requirements:
- Security completeness
-
- All XSS vectors must be removed
- Format preservation
-
- Output must be functionally identical to input except for harmful content removal
- Clean content handling
- Files without XSS content should remain completely unchanged These requirements often conflict - comprehensive parsing may alter formatting, while simple string replacement may miss attack vectors. Approach Selection Option 1: Regex-Based Surgical Removal (Recommended for Format Preservation) When the task explicitly requires preserving original formatting, prefer regex-based approaches that surgically remove only the dangerous content. Advantages: Preserves whitespace, attribute ordering, quote styles exactly Does not reconstruct or reformat HTML Output matches input character-for-character except for removed content Considerations: Requires careful pattern construction to avoid partial matches Must handle various encodings and obfuscation techniques Test patterns against comprehensive XSS vector lists Option 2: HTML Parser-Based Filtering When format preservation is less critical or when dealing with malformed HTML. Considerations: HTML parsers inherently reconstruct output, changing formatting May normalize attribute quotes, whitespace, tag casing Better for malformed HTML that regex cannot reliably parse If using this approach, verify that clean HTML files remain unchanged Comprehensive XSS Vector Checklist Before implementing, research and account for ALL of these attack categories: 1. Script Execution Tags
data:text/html;base64,... encoded payloads 5. CSS-Based Attacks