Guides configuration and auditing of robots.txt for search engine and AI crawler control.
When invoking
On
first use
, if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On
subsequent use
or when the user asks to skip, go directly to the main output.
Scope (Technical SEO)
Robots.txt
Review Disallow/Allow; avoid blocking important pages
Crawler access
Ensure crawlers (including AI crawlers) can access key pages
Indexing
Misconfigured robots.txt can block indexing; verify no accidental blocks
Initial Assessment
Check for product marketing context first:
If
.claude/product-marketing-context.md
or
.cursor/product-marketing-context.md
exists, read it for site URL and indexing goals.
Identify:
Site URL
Base domain (e.g.,
https://example.com
)
Indexing scope
Full site, partial, or specific paths to exclude
AI crawler strategy
Allow search/indexing vs. block training data crawlers
Best Practices
Purpose and Limitations
Point
Note
Purpose
Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet)
No-index
Use noindex meta or auth for sensitive content; robots.txt is publicly readable
Indexed vs non-indexed
Not all content should be indexed. robots.txt and noindex complement each other: robots for path-level crawl control, noindex for page-level indexing. See indexing
Advisory
Rules are advisory; malicious crawlers may ignore
Location and Format
Item
Requirement
Path
Site root:
https://example.com/robots.txt
Encoding
UTF-8 plain text
Standard
RFC 9309 (Robots Exclusion Protocol)
Core Directives
Directive
Purpose
Example
User-agent:
Target crawler
User-agent: Googlebot
,
User-agent: *
Disallow:
Block path prefix
Disallow: /admin/
Allow:
Allow path (can override Disallow)
Allow: /public/
Sitemap:
Declare sitemap absolute URL
Sitemap: https://example.com/sitemap.xml
Clean-param:
Strip query params (Yandex)
See below
Critical: Do Not Block Rendering Resources
Do not
block CSS, JS, images; Google needs them to render pages
Only
block paths that don't need crawling: admin, API, temp files
AI Crawler Strategy
User-agent
Purpose
Typical
OAI-SearchBot
ChatGPT search
Allow
GPTBot
OpenAI training
Disallow
Claude-SearchBot
Claude search
Allow
ClaudeBot
Anthropic training
Disallow
PerplexityBot
Perplexity search
Allow
Google-Extended
Gemini training
Disallow
CCBot
Common Crawl
Disallow
Clean-param (Yandex)
Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid
Output Format
Current state
(if auditing)
Recommended robots.txt
(full file)
Compliance checklist
References
:
Google robots.txt