Is your site visible to AI crawlers? The 5-minute audit
A 2023 copy-pasted robots.txt block silently deletes sites from ChatGPT and Perplexity answers in 2026. The five-minute audit, the 12 bots that matter, and the allow-search, block-training strategy.
In 2023, half the internet pasted the same "block all AI bots" robots.txt snippet during the great scraping panic. Reasonable at the time. But the bots in that snippet now decide whether your site appears in ChatGPT and Perplexity answers, and a growing slice of buyers reads those answers instead of clicking ten blue links. Plenty of site owners deleted themselves from that channel and never noticed.
The audit takes five minutes. Here is the whole thing.
Step one: read your own robots.txt
Open yourdomain.com/robots.txt in a browser. Three outcomes are possible.
It does not exist, a 404. That means every compliant crawler assumes full access. For most product and content sites this is correct, and your audit is nearly done.
It exists and never mentions an AI bot. Then AI crawlers fall under your wildcard rules, whatever applies to User-agent: *. Usually fine, worth confirming.
It exists and contains a wall of User-agent: GPTBot, Disallow: / blocks. This is where the 2023 snippet lives, and where the audit earns its five minutes, because the question is whether you still mean it.
Evaluating the rules by hand gets fiddly: group selection, longest-match precedence, Allow-beats-Disallow ties. The AI crawler checker does the evaluation for any domain and shows the exact rule behind each verdict, so you can confirm every result against your own file.
Step two: know which of the 12 bots does what
The crawler list breaks into two camps, and the entire strategy lives in the difference.
Training bots collect text to train future models: GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl, whose corpus many models train on), Bytespider (ByteDance), meta-externalagent (Meta), Applebot-Extended (Apple), and Google-Extended (Gemini training only, with zero effect on Search rankings). Blocking these keeps your words out of future models. It does not hide you from AI search.
Search and fetch bots retrieve pages to answer live questions, with citations and the referral traffic those citations carry: OAI-SearchBot and ChatGPT-User (OpenAI), PerplexityBot and Perplexity-User, Claude-User (Anthropic). Blocking these removes you from AI answers. That is the lever that costs traffic.
The blanket 2023 snippet blocked both camps at once, which is exactly why it needs revisiting. The decision was never one decision.
Step three: pick the two-list strategy on purpose
Allow the search bots, decide separately about the training bots. That is the whole strategy for most sites.
For our shop the call was easy. We allow all twelve, because an AI assistant recommending our toolkits is free distribution, and our content earns by being found, not by being scarce. I documented the full reasoning, bot by bot, in our robots.txt and llms.txt, explained.
The opposite call is just as rational for a different business. A newsroom or a paid-course shop selling the content itself can sensibly allow citation bots and block training bots, keeping the traffic channel while declining to feed the corpus. What is never rational is the third state most sites are actually in: a robots.txt that encodes a decision nobody remembers making.
Step four, the bonus minute: hand them a map
Once crawlers can reach you, the next question is what they read first. An llms.txt file at your site root is a curated markdown map: site name, one-sentence summary, and annotated links to the pages that answer "what is this site and can I trust it".
Honesty requires saying what it does not do. Google has stated Search ignores the file, and no platform treats it as a ranking input. What it does: AI coding assistants fetch it, some answer engines read it when summarizing a domain, and it is the one document where you control the exact wording an AI encounters first. Ten minutes of work for a maybe is a fair trade. The llms.txt generator builds a spec-valid file in the browser and validates it as you type.
What this audit does not cover
Access is step zero, not the strategy. Whether AI engines actually cite you once they can read you depends on citability: answer-shaped content blocks, quotable facts with numbers attached, schema markup, and entity signals. That is a content discipline, not a config file, and it is the gap where most "AI SEO" effort should actually go. The AI Search Visibility Toolkit packages our full audit workflow for exactly that next layer: eleven skills, citability scoring included.
But run the five minutes first. There is no point optimizing for engines you blocked in 2023 and forgot about.
// faq
Frequently asked
- Does blocking GPTBot remove my site from ChatGPT answers?
- Not directly. GPTBot collects training data; ChatGPT's live answers with citations come through OAI-SearchBot and ChatGPT-User. Blocking GPTBot keeps you out of future training corpora while leaving you citable. Blocking the search bots is what removes you from answers, and traffic.
- Does blocking Google-Extended hurt Google rankings?
- No. Google has documented that Google-Extended only controls Gemini model training. It is not a Search ranking signal, and AI Overviews use Googlebot. You can block Google-Extended and lose nothing in classic search results.
- My site has no robots.txt at all. Is that a problem?
- No file means every compliant crawler assumes full access, which for most product and content sites is exactly the right state. It is the sites that copy-pasted a block-everything snippet during the 2023 panic that need the audit most, because they opted out of AI answers without deciding to.
- Do AI companies actually respect robots.txt?
- The documented major bots do: OpenAI, Anthropic, Google, and Apple publish their user agents and honor the protocol. Reported violations cluster around smaller undisclosed scrapers. robots.txt is a published preference, not a wall; for hard enforcement you need CDN or WAF rules.
// related products
From the shop
// related writing
Keep reading
Written by
İsmail Günaydın
Software Engineer · SEO/GEO/AEO Strategist · Digital Entrepreneur
Software engineer and digital entrepreneur with 15+ years building SEO-driven products. Founder of ModernWebSEO and ToolGenX. Focused on developer experience, web performance, and making technical content accessible. Builds customer-generating digital infrastructure through SEO, AEO, and GEO strategies.