Semgrep Static Analysis When to Use Semgrep

Ideal scenarios:

Quick security scans (minutes, not hours) Pattern-based bug detection Enforcing coding standards and best practices Finding known vulnerability patterns Single-file analysis without complex data flow First-pass analysis before deeper tools

Consider CodeQL instead when:

Need interprocedural taint tracking across files Complex data flow analysis required Analyzing custom proprietary frameworks When NOT to Use

Do NOT use this skill for:

Complex interprocedural data flow analysis (use CodeQL instead) Binary analysis or compiled code without source Custom deep semantic analysis requiring AST/CFG traversal When you need to track taint across many function boundaries Installation

pip

python3 -m pip install semgrep

Homebrew

brew install semgrep

Docker

docker run --rm -v "${PWD}:/src" returntocorp/semgrep semgrep --config auto /src

Update

pip install --upgrade semgrep

Core Workflow 1. Quick Scan semgrep --config auto . # Auto-detect rules semgrep --config auto --metrics=off . # Disable telemetry for proprietary code

Use Rulesets semgrep --config p/ . # Single ruleset semgrep --config p/security-audit --config p/trailofbits . # Multiple

Ruleset Description p/default General security and code quality p/security-audit Comprehensive security rules p/owasp-top-ten OWASP Top 10 vulnerabilities p/cwe-top-25 CWE Top 25 vulnerabilities p/r2c-security-audit r2c security audit rules p/trailofbits Trail of Bits security rules p/python Python-specific p/javascript JavaScript-specific p/golang Go-specific 3. Output Formats semgrep --config p/security-audit --sarif -o results.sarif . # SARIF semgrep --config p/security-audit --json -o results.json . # JSON semgrep --config p/security-audit --dataflow-traces . # Show data flow

Scan Specific Paths semgrep --config p/python app.py # Single file semgrep --config p/javascript src/ # Directory semgrep --config auto --include='/test/' . # Include tests (excluded by default)

Writing Custom Rules Basic Structure rules: - id: hardcoded-password languages: [python] message: "Hardcoded password detected: $PASSWORD" severity: ERROR pattern: password = "$PASSWORD"

Pattern Syntax Syntax Description Example ... Match anything func(...) $VAR Capture metavariable $FUNC($INPUT) <... ...> Deep expression match <... user_input ...> Pattern Operators Operator Description pattern Match exact pattern patterns All must match (AND) pattern-either Any matches (OR) pattern-not Exclude matches pattern-inside Match only inside context pattern-not-inside Match only outside context pattern-regex Regex matching metavariable-regex Regex on captured value metavariable-comparison Compare values Combining Patterns rules: - id: sql-injection languages: [python] message: "Potential SQL injection" severity: ERROR patterns: - pattern-either: - pattern: cursor.execute($QUERY) - pattern: db.execute($QUERY) - pattern-not: - pattern: cursor.execute("...", (...)) - metavariable-regex: metavariable: $QUERY regex: .+.|..format(.|.%.

Taint Mode (Data Flow)

Simple pattern matching finds obvious cases:

Pattern `os.system($CMD)` catches this:

os.system(user_input) # Found

But misses indirect flows:

Same pattern misses this:

cmd = user_input processed = cmd.strip() os.system(processed) # Missed - no direct match

Taint mode tracks data through assignments and transformations:

Source: Where untrusted data enters (user_input) Propagators: How it flows (cmd = ..., processed = ...) Sanitizers: What makes it safe (shlex.quote()) Sink: Where it becomes dangerous (os.system()) rules: - id: command-injection languages: [python] message: "User input flows to command execution" severity: ERROR mode: taint pattern-sources: - pattern: request.args.get(...) - pattern: request.form[...] - pattern: request.json pattern-sinks: - pattern: os.system($SINK) - pattern: subprocess.call($SINK, shell=True) - pattern: subprocess.run($SINK, shell=True, ...) pattern-sanitizers: - pattern: shlex.quote(...) - pattern: int(...)

Full Rule with Metadata rules: - id: flask-sql-injection languages: [python] message: "SQL injection: user input flows to query without parameterization" severity: ERROR metadata: cwe: "CWE-89: SQL Injection" owasp: "A03:2021 - Injection" confidence: HIGH mode: taint pattern-sources: - pattern: request.args.get(...) - pattern: request.form[...] - pattern: request.json pattern-sinks: - pattern: cursor.execute($QUERY) - pattern: db.execute($QUERY) pattern-sanitizers: - pattern: int(...) fix: cursor.execute($QUERY, (params,))

Testing Rules Test File Format

test_rule.py

def test_vulnerable(): user_input = request.args.get("id") # ruleid: flask-sql-injection cursor.execute("SELECT * FROM users WHERE id = " + user_input)

def test_safe(): user_input = request.args.get("id") # ok: flask-sql-injection cursor.execute("SELECT * FROM users WHERE id = ?", (user_input,))

semgrep --test rules/

CI/CD Integration (GitHub Actions) name: Semgrep

on: push: branches: [main] pull_request: schedule: - cron: '0 0 1 * *' # Monthly

jobs: semgrep: runs-on: ubuntu-latest container: image: returntocorp/semgrep

steps:
  - uses: actions/checkout@v4
    with:
      fetch-depth: 0  # Required for diff-aware scanning

  - name: Run Semgrep
    run: |
      if [ "${{ github.event_name }}" = "pull_request" ]; then
        semgrep ci --baseline-commit ${{ github.event.pull_request.base.sha }}
      else
        semgrep ci
      fi
    env:
      SEMGREP_RULES: >-
        p/security-audit
        p/owasp-top-ten
        p/trailofbits

Configuration .semgrepignore tests/fixtures/ **/testdata/ generated/ vendor/ node_modules/

Suppress False Positives password = get_from_vault() # nosemgrep: hardcoded-password dangerous_but_safe() # nosemgrep

Performance semgrep --config rules/ --time . # Check rule performance ulimit -n 4096 # Increase file descriptors for large codebases

Path Filtering in Rules rules: - id: my-rule paths: include: [src/] exclude: [src/generated/]

Third-Party Rules pip install semgrep-rules-manager semgrep-rules-manager --dir ~/semgrep-rules download semgrep -f ~/semgrep-rules .

Rationalizations to Reject Shortcut Why It's Wrong "Semgrep found nothing, code is clean" Semgrep is pattern-based; it can't track complex data flow across functions "I wrote a rule, so we're covered" Rules need testing with semgrep --test; false negatives are silent "Taint mode catches injection" Only if you defined all sources, sinks, AND sanitizers correctly "Pro rules are comprehensive" Pro rules are good but not exhaustive; supplement with custom rules for your codebase "Too many findings = noisy tool" High finding count often means real problems; tune rules, don't disable them Resources Registry: https://semgrep.dev/explore Playground: https://semgrep.dev/playground Docs: https://semgrep.dev/docs/ Trail of Bits Rules: https://github.com/trailofbits/semgrep-rules Blog: https://semgrep.dev/blog/

安装

pip