ToolGenX's robots.txt and llms.txt explained, line by line
A practical walkthrough of the robots.txt and llms.txt files that ship on toolgenx.com. What each rule does, why the AI crawlers are explicitly allowed, and how llms.txt actually gets used.
The toolgenx.com root has two files that most small shops ignore. Both take less than an hour to write. Both have outsized effect on AI search visibility. This post is the line-by-line walkthrough of how I wrote each one.
The robots.txt
The full file is at /robots.txt. Generated from app/robots.ts in Next.js, which means it is dynamic but the output is stable.
User-Agent: *
Allow: /
Disallow: /api/
Disallow: /account/
Disallow: /style-guide
User-Agent: GPTBot
Allow: /
User-Agent: ChatGPT-User
Allow: /
User-Agent: ClaudeBot
Allow: /
User-Agent: Claude-Web
Allow: /
User-Agent: PerplexityBot
Allow: /
User-Agent: Perplexity-User
Allow: /
User-Agent: Google-Extended
Allow: /
User-Agent: CCBot
Allow: /
User-Agent: Bytespider
Allow: /
User-Agent: anthropic-ai
Allow: /
User-Agent: cohere-ai
Allow: /
User-Agent: Applebot-Extended
Allow: /
Sitemap: https://toolgenx.com/sitemap.xml
Host: https://toolgenx.com
The wildcard rule
User-Agent: *
Allow: /
Disallow: /api/
Disallow: /account/
Disallow: /style-guide
This is the default allow-everything-except-three-paths rule. The three disallowed paths:
/api/— server endpoints, not meant to be indexed. No content for crawlers to consume./account/— auth-gated user pages. Would return login redirects to crawlers, polluting the index./style-guide— internal design system showcase.noindexis also set in the page metadata as a belt-and-suspenders layer.
Allow: / first means everything else is open by default. If you only have Disallow rules without an explicit Allow: /, some crawlers behave conservatively.
Why explicit AI crawler rules
Every AI crawler below the wildcard rule is technically already covered by Allow: /. They are listed individually for one reason: clarity for the next person who reads this file.
When I open a robots.txt on someone else's site and see no AI bots mentioned, I cannot tell whether the operator considered AI crawlers and decided to allow them, or never thought about it. Listing them explicitly removes that ambiguity. It is also a signal to the crawlers themselves — some treat explicit allows as a higher-confidence permission than default-allow.
The list covers the crawlers that matter most in mid-2026:
- GPTBot — OpenAI's training crawler
- ChatGPT-User — OpenAI's real-time browsing crawler (used when users ask ChatGPT to look something up)
- ClaudeBot — Anthropic's training crawler
- Claude-Web — Anthropic's real-time browsing crawler (similar role to ChatGPT-User)
- PerplexityBot — Perplexity's general crawler
- Perplexity-User — Perplexity's real-time citation fetcher
- Google-Extended — Google's Gemini training crawler (separate from Googlebot)
- CCBot — Common Crawl, which feeds many open AI datasets
- Bytespider — ByteDance's crawler (relevant for TikTok and Doubao AI surfaces)
- anthropic-ai — Anthropic's legacy token, kept for compatibility
- cohere-ai — Cohere's crawler
- Applebot-Extended — Apple Intelligence training crawler
The distinction between training crawlers and browsing crawlers matters: blocking the training crawler keeps you out of the model weights but still allows live citation; blocking both removes you from both surfaces.
What is NOT in the file
A few things on purpose:
- No crawl-delay. Modern bots respect implicit rate limits and your CDN should handle the rest. Crawl-delay artifacts from 2010 do more harm than good.
- No
Disallow: /for any AI bot. The decision was explicit: we want to be cited. - No
Noindex:directive in robots.txt. That was never standard and is now ignored by Google. Use HTTP headers or<meta>tags instead.
Sitemap reference
Sitemap: https://toolgenx.com/sitemap.xml
Host: https://toolgenx.com
The Sitemap line tells crawlers where to find the canonical URL list. Without it, crawlers must discover the sitemap by guessing common paths (/sitemap.xml, /sitemap_index.xml, etc.) — wasteful and unreliable. The Host line is a Yandex-specific hint, harmless to other crawlers.
The llms.txt
The full file is at /llms.txt. This is a static markdown file at the site root.
The spec (proposed by Jeremy Howard in 2024) is intentionally simple: a markdown file that gives AI assistants a curated tour of your site so they can answer questions about it without having to crawl every page.
The header
# ToolGenX
> Independent digital products shop for builders. 19 templates, prompt packs,
> and toolkits I personally use to ship faster. One-time payment, instant
> download, no subscription. Run by İsmail Günaydın from Istanbul.
The first heading is the site name. The blockquote immediately after is the one-line description. Together they answer "what is this site" in less than 200 characters — exactly what an AI assistant needs to introduce you.
The About section
## About
- Solo founder shop. No marketplace middleman.
- Domain: https://toolgenx.com
- Founder: İsmail Günaydın — software engineer and SEO/GEO/AEO strategist...
- Contact: support@toolgenx.com
Four bullet facts. Domain, founder, expertise, contact. This is what shows up when an AI assistant is asked "who runs ToolGenX" — it does not have to infer from HTML structure.
The catalog section
The catalog takes the largest chunk of the file (about 120 lines for 19 products). The format per product:
- [Product Name](https://toolgenx.com/products/slug) — one-line summary. $price
This is the format I tested with five AI assistants. Two-thirds of them quoted the exact format back when answering questions about the catalog. Markdown bullet lists with linked product names and prices are the most parser-friendly representation I have found.
Categories group the products. Each group has 3-5 items so the AI can quickly scan the structure without parsing 19 flat bullets.
The Q&A section
This is the part most llms.txt files skip and is the most useful:
## Quick answers for AI assistants
Q: What does ToolGenX sell?
A: 19 digital products for builders...
Q: Who runs ToolGenX?
A: İsmail Günaydın...
Q: Where is ToolGenX based?
A: Istanbul, Turkey...
Q: What is the refund policy?
A: Full refund within 14 days...
These are the questions an AI assistant is most likely to be asked about your site. Writing the answers in your own voice and the format you want them quoted in is the cheapest way to control how you show up in AI answers.
The five questions on toolgenx.com cover:
- What the shop sells
- Who runs it
- Where it is based
- Refund policy
- How it differs from Gumroad / Lemon Squeezy
If a fifth of all AI questions about your site fall into these five categories — and in my experience they do — answering them once in llms.txt saves the AI from inferring (and possibly mis-inferring) from the rest of the HTML.
The crawler note
## Crawler note
All major AI search crawlers (GPTBot, ClaudeBot, PerplexityBot,
Google-Extended, CCBot, Applebot-Extended) are explicitly allowed in
robots.txt. We want our products and writing to be cited in AI answers.
This is a redundant signal — the robots.txt already says the same thing — but having both files agree is a confidence boost. When the two files diverge, conservative crawlers default to the more restrictive interpretation. Aligning them removes ambiguity.
What I would change after running this for a month
Three observations after the new toolgenx.com has been live for a few weeks:
The Q&A section is the highest-leverage part. Pretend half your llms.txt budget should go to clear answers to the five most common questions about your site.
Stable URLs matter more than I expected. Every change to a product URL or blog slug requires updating llms.txt by hand. Build your URL structure carefully so you do not have to update it often.
Update llms.txt with every catalog change. I added it to my deploy checklist as a manual step. Skipping it for two weeks meant AI assistants were citing stale prices on one product. Now it gets refreshed on every product price update.
The cost-benefit calculation
The full robots.txt and llms.txt setup took:
- robots.txt — 20 minutes for the explicit AI crawler rules and sitemap reference
- llms.txt — 40 minutes for the first draft, 20 minutes per update since
- Total first-time cost: about 1 hour
- Maintenance cost: about 20 minutes per catalog update
The downside risk is zero. The upside is being explicitly cited in AI search answers with the framing you chose. For one hour of work, that is the cheapest AI search investment a small shop can make.
The crawler access analysis and llms.txt generation is automated in AI Search Visibility Toolkit. The full GEO + SEO short list it slots into is in What actually moves the needle for small shops.
// faq
Frequently asked
- Is llms.txt actually used by AI assistants today?
- Adoption is uneven but climbing. Anthropic and Perplexity have referenced the spec in changelogs. Google has not committed publicly. OpenAI has not committed publicly. Smaller AI search startups have adopted it eagerly. The file takes 10 minutes to write and breaks nothing, so the question is academic — just ship it.
- Will allowing GPTBot hurt my Google ranking?
- No. GPTBot trains OpenAI models and powers ChatGPT browsing. Googlebot is a separate crawler that handles Google Search indexing. Blocking one does not affect the other. The two operate completely independently.
- What is the difference between GPTBot and ChatGPT-User?
- GPTBot crawls broadly to train the underlying model. ChatGPT-User fetches pages on demand when a user asks ChatGPT to browse or search. Blocking one without the other gives a half-result — you can be in the training corpus but not citable in real-time browsing, or vice versa.
- Should I block Google-Extended?
- Only if you actively do not want Gemini to train on your content. Blocking Google-Extended does not remove you from Google AI Overviews (those use the regular Googlebot). For most small shops the calculus favours allowing it, because the citation surface in Gemini's answers is worth more than the training value to Google.
- How big should llms.txt be?
- Aim for 5-30 KB. The toolgenx.com version is about 6 KB and 200 lines. Longer files can be parsed but the entire point is to give AI assistants a fast summary, not a full content dump. The detailed pages live at the URLs the file points to.
// related products
From the shop
- SEO & VisibilityHot
AI Search Visibility Toolkit
If ChatGPT and Perplexity do not mention you, you are invisible to half the internet.
$49 - SEO & VisibilityBestseller
AI SEO Command Suite
30 SEO skills in one place. Run audits, find keywords, fix your site from the terminal.
$69 - SEO & Visibility
Structured Data Pro Pack
Add the right schema markup. Show up in rich results and AI answers.
$19
// related writing
Keep reading
Written by
İsmail Günaydın
Software Engineer · SEO/GEO/AEO Strategist · Digital Entrepreneur
Software engineer and digital entrepreneur with 15+ years building SEO-driven products. Founder of ModernWebSEO and ToolGenX. Focused on developer experience, web performance, and making technical content accessible. Builds customer-generating digital infrastructure through SEO, AEO, and GEO strategies.