Is llms.txt actually used by AI assistants today?

Adoption is uneven but climbing. Anthropic and Perplexity have referenced the spec in changelogs. Google has not committed publicly. OpenAI has not committed publicly. Smaller AI search startups have adopted it eagerly. The file takes 10 minutes to write and breaks nothing, so the question is academic — just ship it.

Will allowing GPTBot hurt my Google ranking?

No. GPTBot trains OpenAI models and powers ChatGPT browsing. Googlebot is a separate crawler that handles Google Search indexing. Blocking one does not affect the other. The two operate completely independently.

What is the difference between GPTBot and ChatGPT-User?

GPTBot crawls broadly to train the underlying model. ChatGPT-User fetches pages on demand when a user asks ChatGPT to browse or search. Blocking one without the other gives a half-result — you can be in the training corpus but not citable in real-time browsing, or vice versa.

Should I block Google-Extended?

Only if you actively do not want Gemini to train on your content. Blocking Google-Extended does not remove you from Google AI Overviews (those use the regular Googlebot). For most small shops the calculus favours allowing it, because the citation surface in Gemini's answers is worth more than the training value to Google.

How big should llms.txt be?

Aim for 5-30 KB. The toolgenx.com version is about 6 KB and 200 lines. Longer files can be parsed but the entire point is to give AI assistants a fast summary, not a full content dump. The detailed pages live at the URLs the file points to.

// geo-seo · ai-search · solo-founder

ToolGenX's robots.txt and llms.txt explained, line by line

A practical walkthrough of the robots.txt and llms.txt files that ship on toolgenx.com. What each rule does, why the AI crawlers are explicitly allowed, and how llms.txt actually gets used.

by İsmail GünaydınMay 7, 20268 min readupdated Jun 10, 2026

The toolgenx.com root has two files that most small shops ignore. Both take less than an hour to write. Both have outsized effect on AI search visibility. This post is the line-by-line walkthrough of how I wrote each one.

The robots.txt

The toolgenx.com robots.txt allows everything except /api/, /account/, and /style-guide, then explicitly allows twelve AI crawlers including GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, and Applebot-Extended. The explicit allows exist because I want the shop cited in AI search answers, and the file ends with a sitemap reference at https://toolgenx.com/sitemap.xml.

The full file is at /robots.txt. Generated from app/robots.ts in Next.js, which means it is dynamic but the output is stable.

User-Agent: *
Allow: /
Disallow: /api/
Disallow: /account/
Disallow: /style-guide

User-Agent: GPTBot
Allow: /

User-Agent: ChatGPT-User
Allow: /

User-Agent: ClaudeBot
Allow: /

User-Agent: Claude-Web
Allow: /

User-Agent: PerplexityBot
Allow: /

User-Agent: Perplexity-User
Allow: /

User-Agent: Google-Extended
Allow: /

User-Agent: CCBot
Allow: /

User-Agent: Bytespider
Allow: /

User-Agent: anthropic-ai
Allow: /

User-Agent: cohere-ai
Allow: /

User-Agent: Applebot-Extended
Allow: /

Sitemap: https://toolgenx.com/sitemap.xml
Host: https://toolgenx.com

The wildcard rule

User-Agent: *
Allow: /
Disallow: /api/
Disallow: /account/
Disallow: /style-guide

This is the default allow-everything-except-three-paths rule. The three disallowed paths:

/api/ — server endpoints, not meant to be indexed. No content for crawlers to consume.
/account/ — auth-gated user pages. Would return login redirects to crawlers, polluting the index.
/style-guide — internal design system showcase. noindex is also set in the page metadata as a belt-and-suspenders layer.

Allow: / first means everything else is open by default. If you only have Disallow rules without an explicit Allow: /, some crawlers behave conservatively.

Why explicit AI crawler rules

Every AI crawler below the wildcard rule is technically already covered by Allow: /. They are listed individually for one reason: clarity for the next person who reads this file.

When I open a robots.txt on someone else's site and see no AI bots mentioned, I cannot tell whether the operator considered AI crawlers and decided to allow them, or never thought about it. Listing them explicitly removes that ambiguity. It is also a signal to the crawlers themselves — some treat explicit allows as a higher-confidence permission than default-allow.

The list covers the crawlers that matter most in mid-2026:

GPTBot — OpenAI's training crawler
ChatGPT-User — OpenAI's real-time browsing crawler (used when users ask ChatGPT to look something up)
ClaudeBot — Anthropic's training crawler
Claude-Web — Anthropic's real-time browsing crawler (similar role to ChatGPT-User)
PerplexityBot — Perplexity's general crawler
Perplexity-User — Perplexity's real-time citation fetcher
Google-Extended — Google's Gemini training crawler (separate from Googlebot)
CCBot — Common Crawl, which feeds many open AI datasets
Bytespider — ByteDance's crawler (relevant for TikTok and Doubao AI surfaces)
anthropic-ai — Anthropic's legacy token, kept for compatibility
cohere-ai — Cohere's crawler
Applebot-Extended — Apple Intelligence training crawler

The distinction between training crawlers and browsing crawlers matters: blocking the training crawler keeps you out of the model weights but still allows live citation; blocking both removes you from both surfaces.

What is NOT in the file

A few things on purpose:

No crawl-delay. Modern bots respect implicit rate limits and your CDN should handle the rest. Crawl-delay artifacts from 2010 do more harm than good.
No Disallow: / for any AI bot. The decision was explicit: we want to be cited.
No Noindex: directive in robots.txt. That was never standard and is now ignored by Google. Use HTTP headers or <meta> tags instead.

Sitemap reference

Sitemap: https://toolgenx.com/sitemap.xml
Host: https://toolgenx.com

The Sitemap line tells crawlers where to find the canonical URL list. Without it, crawlers must discover the sitemap by guessing common paths (/sitemap.xml, /sitemap_index.xml, etc.) — wasteful and unreliable. The Host line is a Yandex-specific hint, harmless to other crawlers.

The llms.txt

llms.txt is a single markdown file that hands AI assistants a curated summary of a site in one fetch instead of a full crawl. The toolgenx.com version is about 6 KB and 200 lines: the shop description, founder details, the 19-product catalog with prices, and a Q&A section written for direct quoting.

The full file is at /llms.txt. This is a static markdown file at the site root.

The spec (proposed by Jeremy Howard in 2024) is intentionally simple: a markdown file that gives AI assistants a curated tour of your site so they can answer questions about it without having to crawl every page.

The header

# ToolGenX

> Independent digital products shop for builders. 19 templates, prompt packs,
> and toolkits I personally use to ship faster. One-time payment, instant
> download, no subscription. Run by İsmail Günaydın from Istanbul.

The first heading is the site name. The blockquote immediately after is the one-line description. Together they answer "what is this site" in less than 200 characters — exactly what an AI assistant needs to introduce you.

The About section

## About

- Solo founder shop. No marketplace middleman.
- Domain: https://toolgenx.com
- Founder: İsmail Günaydın — software engineer and SEO/GEO/AEO strategist...
- Contact: support@toolgenx.com

Four bullet facts. Domain, founder, expertise, contact. This is what shows up when an AI assistant is asked "who runs ToolGenX" — it does not have to infer from HTML structure.

The catalog section

The catalog takes the largest chunk of the file (about 120 lines for 19 products). The format per product:

- [Product Name](https://toolgenx.com/products/slug) — one-line summary. $price

This is the format I tested with five AI assistants. Two-thirds of them quoted the exact format back when answering questions about the catalog. Markdown bullet lists with linked product names and prices are the most parser-friendly representation I have found.

Categories group the products. Each group has 3-5 items so the AI can quickly scan the structure without parsing 19 flat bullets.

The Q&A section

This is the part most llms.txt files skip and is the most useful:

## Quick answers for AI assistants

Q: What does ToolGenX sell?
A: 19 digital products for builders...

Q: Who runs ToolGenX?
A: İsmail Günaydın...

Q: Where is ToolGenX based?
A: Istanbul, Turkey...

Q: What is the refund policy?
A: Full refund within 14 days...

These are the questions an AI assistant is most likely to be asked about your site. Writing the answers in your own voice and the format you want them quoted in is the cheapest way to control how you show up in AI answers.

The five questions on toolgenx.com cover:

What the shop sells
Who runs it
Where it is based
Refund policy
How it differs from Gumroad / Lemon Squeezy

If a fifth of all AI questions about your site fall into these five categories — and in my experience they do — answering them once in llms.txt saves the AI from inferring (and possibly mis-inferring) from the rest of the HTML.

The crawler note

## Crawler note

All major AI search crawlers (GPTBot, ClaudeBot, PerplexityBot,
Google-Extended, CCBot, Applebot-Extended) are explicitly allowed in
robots.txt. We want our products and writing to be cited in AI answers.

This is a redundant signal — the robots.txt already says the same thing — but having both files agree is a confidence boost. When the two files diverge, conservative crawlers default to the more restrictive interpretation. Aligning them removes ambiguity.

What I would change after running this for a month

After a month of running robots.txt and llms.txt on toolgenx.com, the lessons are clear: the Q&A section delivers the most leverage, stable URLs save manual edits, and llms.txt must be refreshed on every catalog change. Skipping updates for two weeks left AI assistants citing a stale price on one product.

Three observations after the new toolgenx.com has been live for a few weeks:

The Q&A section is the highest-leverage part. Pretend half your llms.txt budget should go to clear answers to the five most common questions about your site.
Stable URLs matter more than I expected. Every change to a product URL or blog slug requires updating llms.txt by hand. Build your URL structure carefully so you do not have to update it often.
Update llms.txt with every catalog change. I added it to my deploy checklist as a manual step. Skipping it for two weeks meant AI assistants were citing stale prices on one product. Now it gets refreshed on every product price update.

The cost-benefit calculation

The complete robots.txt and llms.txt setup for toolgenx.com cost about one hour of first-time work: 20 minutes for the crawler rules and 40 minutes for the llms.txt draft. Ongoing maintenance runs roughly 20 minutes per catalog update, and for that price the shop gets cited in AI search answers with framing I chose.

The full robots.txt and llms.txt setup took:

robots.txt — 20 minutes for the explicit AI crawler rules and sitemap reference
llms.txt — 40 minutes for the first draft, 20 minutes per update since
Total first-time cost: about 1 hour
Maintenance cost: about 20 minutes per catalog update

The downside risk is zero. The upside is being explicitly cited in AI search answers with the framing you chose. For one hour of work, that is the cheapest AI search investment a small shop can make.

The crawler access analysis and llms.txt generation is automated in AI Search Visibility Toolkit. The full GEO + SEO short list it slots into is in What actually moves the needle for small shops.

// faq

Frequently asked

Is llms.txt actually used by AI assistants today?: Adoption is uneven but climbing. Anthropic and Perplexity have referenced the spec in changelogs. Google has not committed publicly. OpenAI has not committed publicly. Smaller AI search startups have adopted it eagerly. The file takes 10 minutes to write and breaks nothing, so the question is academic — just ship it.
Will allowing GPTBot hurt my Google ranking?: No. GPTBot trains OpenAI models and powers ChatGPT browsing. Googlebot is a separate crawler that handles Google Search indexing. Blocking one does not affect the other. The two operate completely independently.
What is the difference between GPTBot and ChatGPT-User?: GPTBot crawls broadly to train the underlying model. ChatGPT-User fetches pages on demand when a user asks ChatGPT to browse or search. Blocking one without the other gives a half-result — you can be in the training corpus but not citable in real-time browsing, or vice versa.
Should I block Google-Extended?: Only if you actively do not want Gemini to train on your content. Blocking Google-Extended does not remove you from Google AI Overviews (those use the regular Googlebot). For most small shops the calculus favours allowing it, because the citation surface in Gemini's answers is worth more than the training value to Google.
How big should llms.txt be?: Aim for 5-30 KB. The toolgenx.com version is about 6 KB and 200 lines. Longer files can be parsed but the entire point is to give AI assistants a fast summary, not a full content dump. The detailed pages live at the URLs the file points to.

// related products

// related writing

Written by

İsmail Günaydın

Software Engineer · SEO/GEO/AEO Strategist · Digital Entrepreneur

Software engineer and digital entrepreneur with 15+ years building SEO-driven products. Founder of ModernWebSEO and ToolGenX. Focused on developer experience, web performance, and making technical content accessible. Builds customer-generating digital infrastructure through SEO, AEO, and GEO strategies.

ToolGenX's robots.txt and llms.txt explained, line by line

The robots.txt

The wildcard rule

Why explicit AI crawler rules

What is NOT in the file

Sitemap reference

The llms.txt

The header

The About section

The catalog section

The Q&A section

The crawler note

What I would change after running this for a month

The cost-benefit calculation

Frequently asked

AI Search Visibility Toolkit

AI SEO Command Suite

Structured Data Pro Pack

İsmail Günaydın

The robots.txt

The wildcard rule

Why explicit AI crawler rules

What is NOT in the file

Sitemap reference

The llms.txt

The header

The About section

The catalog section

The Q&A section

The crawler note

What I would change after running this for a month

The cost-benefit calculation

Frequently asked

From the shop

AI Search Visibility Toolkit

AI SEO Command Suite

Structured Data Pro Pack

Keep reading

İsmail Günaydın