What Is llms.txt and Why Does Your Site Need One
When a search engine crawler visits your site, it checks robots.txt first. That file tells it which pages it can index, which to skip, and where the sitemap is. It has been a web standard since 1994. Every crawler respects it.
AI systems — language models, agents, retrieval pipelines — have no equivalent signal to look for. They arrive at a site with no structured indication of what it contains, what they are permitted to use, or how it is organised. They either scrape everything indiscriminately or guess from context.
llms.txt is the proposed solution. It is a plain text file placed at the root of your site — https://yoursite.com/llms.txt — that provides AI systems with the information they need to interact with your content appropriately.
What Goes in It
The file has a simple structure. At its most basic, it contains:
- A brief description of the site and its purpose
- A list of sections or content areas, with brief descriptions
- Links to key pages or documents the AI should be aware of
- Any permissions or restrictions on how the content can be used
A minimal example:
# Digital Domain Technologies Ltd
> MX consultancy providing Machine Experience strategy, training, and implementation.
## Services
- MX Strategy Consulting: Strategic guidance on designing content for human and machine audiences.
- MX Training: Workshops and courses for content teams.
## Resources
- MX Blog: Practical articles on Machine Experience.
## Notes
Content on this site may be used by AI systems for the purpose of answering user queries.
Do not reproduce full articles without attribution.
That is the entire file. Plain text, human-readable, machine-readable, no specialist tooling required to produce or maintain it.
A more complete implementation includes access guidelines — rate limits, cache retention policies, attribution requirements, and content restrictions. MX: The Protocols (Chapter 12) provides full templates for e-commerce, content publishing, and service-oriented sites, including sections for API access, identity delegation, and content restrictions by area.
Why It Matters
Language models and AI agents that interact with your site need to make decisions: which pages are relevant to a query, which content is authoritative, what the site is for, and whether they have permission to summarise or quote its content.
Without llms.txt, those decisions are made by inference. The model guesses based on page titles, headings, and body text. It may index pages you would rather it did not. It may ignore sections that are directly relevant. It has no way to know your content permissions.
With llms.txt, those decisions are informed. The model knows what the site covers, where the relevant content lives, and what you permit. That reduces errors and increases the likelihood that AI systems represent your content accurately.
This matters most for sites where accuracy is significant — professional services, technical documentation, product information, healthcare, legal. For those sites, a language model misrepresenting content because it had no structured guidance is a real risk, not a theoretical one.
Recent audits show an 85% adoption gap for llms.txt across professional sites. Most have not implemented it, forcing agents to crawl entire site structures to understand organisation — but many of those same sites block agent crawlers entirely through robots.txt. The result is a double exclusion: no guidance for agents that do arrive, and active blocking of agents that try.
The Critical Gap
There is an important limitation that anyone implementing llms.txt should understand.
llms.txt is served as a text or markdown MIME type, not HTML. Common Crawl — the dataset that feeds most large language model training — ingests HTML pages. It does not typically ingest non-HTML files. llms.txt is also rarely included in sitemap.xml, so training-time crawlers may never discover it.
At inference time, the picture is no better. When a machine is answering a specific user query, it goes straight to relevant pages. It does not fetch a site-level directory first. llms.txt is too broad for targeted queries.
The result: llms.txt risks falling between both mechanisms — invisible to training crawlers and irrelevant to inference-time agents.
The practical mitigation today is to publish the same content as an HTML page — /llms.html or /about/for-agents — and include it in your sitemap. This ensures training crawlers ingest the guidance and it enters model knowledge bases. The Gathering — the independent standards body that governs MX specifications — is proposing a new standard to address this critical lack.
This gap does not make llms.txt pointless. It makes it necessary but insufficient on its own. The HTML equivalent closes the gap.
The Relationship to robots.txt
robots.txt controls crawling. It tells automated systems which URLs they can and cannot visit. It is a technical permission layer.
llms.txt is different in character. It does not control access — AI systems can and do ignore it if they choose, just as some crawlers ignore robots.txt. What it does is provide structured context that well-behaved AI systems can use to do a better job.
Think of it less as a gate and more as a briefing document. You are telling the AI what it needs to know before it starts working with your content.
The three files work together as a system:
| File | Purpose | Audience |
|---|---|---|
| robots.txt | Access control | Search bots |
| sitemap.xml | Content discovery | Search engines |
| llms.txt | Interaction guidance | AI agents |
robots.txt enforces boundaries. sitemap.xml provides structure. llms.txt offers context. Reference your llms.txt in robots.txt with a comment — it costs nothing and signals intent.
Sites with complete robots.txt files often find llms.txt easy to create because the mental model is identical: you are documenting your site structure for machines that cannot intuit context from visual design.
The Relationship to Agent Cards
llms.txt and agent cards address the same underlying problem from different angles.
llms.txt is aimed at language models and retrieval systems — AI that reads your content to answer questions. It describes what is there and what is permitted.
An agent card (defined by the Agent2Agent protocol) is aimed at autonomous agents that want to interact with your service — AI that takes actions on behalf of users. It describes what your service can do and how to call it, published as a JSON file at /.well-known/agent-card.json.
A site with both is telling the full story: here is what we contain, here is what we permit, and here is how to work with us as a service. That is the foundation of genuine agent discoverability.
For informational sites, llms.txt is the priority. For transactional or service-oriented sites, agent cards matter more. Most sites benefit from both.
The Three-Layer Approach
The most effective machine compatibility combines three complementary systems, each building on the last:
Layer 1 — llms.txt (site-wide defaults). Emerging convention. Describes the site as a whole: what it contains, access guidelines, rate limits, and content permissions. Every page benefits from the site-level context.
Layer 2 — Page-level metadata. MX carrier tags and standard HTML meta tags can override or supplement the site-wide defaults for individual pages. A product page might allow full extraction even if the site default is restricted. A <link rel="api"> element can point to a page-specific API endpoint.
Layer 3 — JSON-LD structured data. The actual content in machine-readable form — Schema.org Product, Article, Organisation types with their specific properties. This is what agents extract and act on.
A machine visiting your page checks llms.txt for site policy, reads page-level metadata for this specific page, fetches structured data, and respects your rate limits and guidance. Each layer adds specificity. No single layer is sufficient alone.
Extended llms.txt with Metadata
The standard llms.txt format specifies only URLs to curated content. However, when machines access llms.txt directly — bypassing your HTML pages — they miss all the metadata layers that HTML provides: author information, company details, publication context.
MX proposes extending llms.txt with markdown-formatted metadata at the top of the file:
# About This Site
**Author:** Tom Cranstoun
**Company:** CogNovaMX Ltd
**Focus:** Machine compatibility, web accessibility, GEO patterns
**Contact:** [email protected]
Tom Cranstoun works on making websites accessible to both humans and machines.
# Curated Resources
## Book
https://allabout.network/mx-handbook
## Technical Documentation
https://allabout.network/docs/agent-patterns
This approach compensates for metadata loss when machines read llms.txt instead of parsing full HTML pages. The markdown format provides human-readable context whilst remaining machine-parseable. Standard llms.txt parsers that expect only URLs skip the markdown header and process the URL sections normally — the extension is backwards-compatible.
This is a proposed enhancement, not part of the current llms.txt specification. The standard URL-only format remains valid. But extended metadata improves machine citation accuracy when HTML metadata is unavailable. See Appendix H of MX: The Protocols for complete implementation guidance.
How to Create One
The file is plain text. You do not need a build tool, a CMS plugin, or a developer. You need a text editor and access to your site's root directory.
Work through these questions and write the answers down:
What is this site for, in one or two sentences? What are its main sections or content areas? What are the most important pages an AI system should know about? What, if anything, are you restricting?
Put the answers in the format above, save the file as llms.txt, and place it at your domain root.
Then — and this is the step most guides omit — publish an HTML version of the same content at a URL included in your sitemap. Without the HTML equivalent, training crawlers may never find your guidance.
Review it when your site structure changes significantly, when you add new content areas, or when your permissions policy changes.
The MX View
In MX terms, llms.txt sits within a spectrum of signals that make content meaning explicit rather than leaving machines to infer it. YAML frontmatter, Schema.org structured data, semantic HTML, Open Graph tags, Dublin Core metadata — they are all mechanisms serving the same principle.
MX does not replace any of these standards. It amplifies them. The MX principle is: use the right standard for the context, ensure it is present and correct, and add MX governance metadata only where existing standards leave gaps — lifecycle status, content policy, audience targeting, and AI interaction rules.
The principle is consistent: machines that receive structured, explicit information about content make fewer errors and produce more accurate outputs than machines that must infer everything from context. When machines must "think" — infer meaning from incomplete structure — they hallucinate. Explicit structure prevents this.
The effort to provide that information is low. The benefit accumulates over every AI interaction with your site. And the work is not novel — it is the same structured, semantic, accessible content practice that good web development has always recommended. The urgency is what has changed.
MX: The Handbook covers the full stack of these signals — from document-level metadata to site-level discoverability — and how they work together to create content that serves both human and machine audiences reliably. MX: The Protocols provides the technical specifications, templates, and implementation guidance.
Related reading
- Agent Discoverability: What Your Site Is Missing — diagnose the signals AI agents look for
- What Is Machine Experience? — the discipline behind these patterns
- Machine Experience: Adding Metadata — the 5-stage agent journey
- MX: A New Role — audit data and the convergence principle
Tom Cranstoun is the Machine Experience Authority and founder of the MX community. His book MX: The Handbook is available now. He consults on MX strategy through CogNovaMX Ltd.