What structured signals do AI agents need to discover, cite, compare, and transact on a website, and what does each gap cost?

AI agents follow a 5-stage journey (Discovery, Citation, Compare, Pricing, Confidence) and look for specific signals at each stage: permissive robots.txt, descriptive llms.txt, agent cards for services, semantic HTML structure, Schema.org JSON-LD, and WCAG-compliant content. Audits show 70% of professional sites lack semantic HTML, 85% have no llms.txt, 60% block AI crawlers, and 55% have incomplete Schema.org. Missing any stage breaks the entire chain — and unlike humans, agents don't offer second chances.

Agent Discoverability: What Your Site Is Missing

31 March 2026 · 12 min read

AI agents that act on behalf of users — finding services, comparing options, making recommendations, completing transactions — do not discover websites the way search engines do. They look for structured signals at specific locations. If those signals are absent, the site is functionally invisible to that class of agent, regardless of how good its content is.

Most sites are missing most of these signals. Recent audits paint a stark picture: 70% of professional sites lack proper semantic HTML, 85% have no llms.txt file, 60% actively block major AI crawlers, and 55% have missing or partial Schema.org coverage. These are not budget-constrained sites. These patterns appear across organisations with sophisticated digital teams, substantial web budgets, and public commitments to digital excellence. The gap is not about resources. It is about awareness.

This post diagnoses what the signals are, what the absence of each one costs, and what fixing it involves.

The 5-Stage Agent Journey

Before examining individual layers, it helps to understand what agents are trying to do. When AI agents interact with a website, they follow a predictable journey with five stages:

Discovery — Can agents find you? Requires crawlable structure, semantic HTML, server-side rendering.
Citation — Can agents confidently cite you? Requires fact-level clarity, Schema.org JSON-LD, citation-worthy architecture.
Compare — Can agents understand your offering relative to others? Requires explicit comparison attributes, structured pricing data.
Pricing — Can agents understand your costs without error? Requires Schema.org Product/Offer types with unambiguous currency (ISO 4217 codes).
Confidence — Can agents complete the user's goal? Requires explicit form semantics, DOM-reflected state, persistent feedback.

The catastrophic failure principle applies: miss any stage and the entire chain breaks. A site that is discoverable but uncitable is functionally the same as a site that is invisible — the agent cannot recommend it. Each layer described below maps to one or more of these stages.

The Crawl Layer

Before any content is read, an agent checks whether it is permitted to read it. This is Stage 1 — Discovery — and it starts with robots.txt.

Audits show 60% of professional sites block major AI agents. Sites routinely block GPTBot, ClaudeBot, Amazonbot, and other AI crawlers through robots.txt directives or services like Cloudflare. The irony is stark: organisations want AI-mediated recommendations but actively prevent agents from accessing the content they need to make those recommendations.

Many sites block AI crawlers without intending to — typically because they added broad disallow rules to block scrapers and those rules catch legitimate AI user-agent strings too. The result is a site that has actively told AI systems to stay away. If your robots.txt blocks AI crawlers, you are opting out of AI indexing entirely. Zero recommendations. Zero citations. Complete invisibility.

Check your robots.txt and verify which user agents are disallowed. The worst-agent design principle applies here: you cannot detect which agent is visiting — User-Agent strings are spoofable. Design for the worst agent, and you are compatible with all agents.

The inverse problem also exists: no robots.txt at all, which leaves AI systems with no guidance. A minimal robots.txt that explicitly permits reputable AI crawlers is a positive signal, not just the absence of a negative one.

The Site Description Layer

An agent that is permitted to crawl your site still has no structured description of what it will find. llms.txt fills this gap — and 85% of sites have not implemented it.

A site without llms.txt forces AI systems to infer its purpose, structure, and permissions from page content alone. That inference is imprecise. The model may mischaracterise the site's subject matter, miss important content areas, or apply default permissions that do not match your intent.

llms.txt is a plain text file at your domain root. It describes the site in terms an AI can use: what it is for, what its main sections contain, which pages are most relevant, and what you permit. It takes less than an hour to write for most sites and requires no technical infrastructure beyond the ability to place a file at your domain root.

There is an important caveat. llms.txt is served as a text or markdown MIME type, not HTML. Training-time crawlers (Common Crawl and its derivatives) do not typically ingest non-HTML files. At inference time, agents go straight to relevant pages — they do not fetch a site-level directory first. To close this gap today, publish the same content as an HTML page (for example, /llms.html or /about/for-agents) and include it in your sitemap, so training crawlers ingest it and the guidance enters model knowledge bases.

A site without one is leaving its AI representation to chance. A site with one — plus an HTML equivalent in the sitemap — is providing agents with a briefing document before they start working with the content. For a full guide, see What Is llms.txt and Why Does Your Site Need One.

The Service Description Layer

llms.txt describes content. An agent card describes a service.

If your site is more than a collection of articles — if it offers something that agents might want to use on behalf of a user, from booking to data retrieval to document processing — an agent card is how you make that service findable in agentic workflows.

The Agent2Agent (A2A) protocol defines the format: a JSON file at /.well-known/agent-card.json describing your service's capabilities, endpoint, and authentication requirements. An agent looking for a service that can perform a particular task will check this location. If there is nothing there, your service is absent from that selection process.

For informational sites, this layer is less pressing. For transactional or service-oriented sites — anything where Stage 5 (Confidence) matters — it is the most important gap to close.

The Page Structure Layer

At the individual page level, agents extract meaning from HTML structure. They rely on semantic elements — <main>, <article>, <nav>, <header>, <section>, <h1> through <h6> — to understand what a page contains and how it is organised.

70% of sites audited lack proper semantic HTML. Most use generic <div> containers with CSS classes for visual hierarchy. Agents parsing served HTML — the static HTML sent from your server before JavaScript executes — cannot distinguish navigation from content from sidebars. The structure that humans see visually does not exist in the HTML.

This is the served HTML versus rendered HTML distinction. Many AI agents — server-side parsers like those behind ChatGPT and Claude — fetch your URL and process raw HTML without executing JavaScript. If your site requires JavaScript to display products, show prices, or render navigation, these agents see nothing. Your carefully crafted user experience is invisible to them.

Even browser-based agents that execute JavaScript need semantic structure. They can see everything humans see, but they parse structure like server-side agents. Visual design cues — colour, spacing, animation — do not help agents understand content purpose.

The practical rule: design for the worst-case agent (served HTML, no JavaScript), and you automatically support all agents.

Audit a sample of your pages. Check whether the HTML uses semantic elements correctly, whether heading hierarchy is logical and unbroken, whether the main content area is identifiable as <main>, and whether navigation, sidebars, and footers are correctly labelled. These are the same checks that WCAG accessibility audits perform — the convergence principle in practice.

The Structured Data Layer

Schema.org markup tells machines not just that something is content, but what kind of content it is. An Article is different from a Product, a LocalBusiness, an Event, or a Service. Each type carries specific properties that agents can read and act on.

55% of sites audited have missing or partial Schema.org coverage. Structured data exists on some pages but not others. Product pages have pricing Schema.org, but comparison tables lack it. Event pages have dates but not registration URLs. The inconsistent implementation forces agents to guess which pages contain authoritative data.

A page with proper structured pricing metadata answers the question of what something costs in milliseconds at near-zero compute cost. A page without it forces every visiting machine to spend tokens figuring out the price, the currency, and the availability — and to risk getting it wrong. The Danube cruise error, where £2,030 became £203,000 because European decimal formatting was misinterpreted, is not a theoretical risk. It happened.

The six Schema.org types that cover about 90% of what most sites need: Organisation/LocalBusiness, Article/BlogPosting, Product/Offer, FAQPage, HowTo, and WebPage/WebSite. Use JSON-LD — it separates structured data from your HTML, making it easier to maintain, simpler to implement, and more reliably parsed.

Common gaps to check: articles without Article markup, product pages without Product and Offer markup, contact pages without LocalBusiness or Organization markup, and FAQ content without FAQPage markup. Each gap is an opportunity for an agent to misunderstand what the page contains.

The Accessibility Layer

WCAG compliance and agent discoverability are not separate concerns. The convergence principle — that the techniques which make content accessible to disabled users are the same techniques that make it accessible to AI agents — means that accessibility failures are also machine readability failures.

The overlap is not coincidental. Both groups — disabled users and AI agents — lack access to visual design cues. A missing <main> element forces screen reader users to navigate the entire page to find primary content. It forces agents to do the same. Missing alt text blocks both agents and blind users. Visual-only state indicators exclude both agents and keyboard users.

The specific WCAG criteria that map directly to agent discoverability:

WCAG 1.1.1 (Non-text Content) — alt text on images. Without it, agents cannot understand visual content.
WCAG 1.3.1 (Info and Relationships) — semantic structure. Without it, agents cannot parse page hierarchy.
WCAG 2.4.4 (Link Purpose) — meaningful link text. "Click here" tells an agent nothing about destination.
WCAG 4.1.1 (Parsing) — valid HTML. Malformed markup breaks machine parsers.

75% of sites audited have explicit state missing — form validation errors display as visual colour changes, checkout progress shows via CSS-animated steppers, button states indicate loading with spinners. None of this state appears in HTML attributes where agents can read it. State exists visually but not semantically.

A WCAG audit of your site is simultaneously an MX audit. Errors in the accessibility report are errors in your machine experience. They are the same problems. One implementation serves both audiences.

What This Means in Practice

A site that has addressed all of these layers — permissive robots.txt, descriptive llms.txt (with HTML equivalent), an agent card for its services, semantic HTML, Schema.org JSON-LD, and WCAG-compliant content — is as visible to AI agents as a well-optimised site is to search engines.

A site that has addressed none of them is invisible to the growing class of agents that act on behalf of users, regardless of how good its content is or how strong its search engine ranking. And unlike humans who persist through bad UX and can be won back with improvements, agents provide no analytics visibility and offer no second chance. First-mover citation advantage creates durable competitive moats — machines that have learned to prefer a competitor are unlikely to periodically re-evaluate the alternatives.

The work is not novel. Most of it is the same structured, semantic, accessible content practice that good web development has always recommended. The urgency is new. As agent-mediated discovery becomes a standard part of how people find and use services, the cost of these gaps grows proportionally.

MX: The Handbook sets out the full framework for designing content that serves both human and machine audiences, across all of these layers, from document metadata to site-level discoverability. MX: The Protocols covers the technical specifications, templates, and phased implementation in detail.