Definition
llms.txt is an informal open standard for a plain-text file hosted at the root of a website at the path /llms.txt, designed to provide a structured, human- and machine-readable content map targeted at large language model (LLM) crawlers and AI-powered discovery systems. The convention was proposed in 2024 by Jeremy Howard, co-founder of fast.ai and a prominent figure in applied deep learning, as an analogue to robots.txt (which governs crawler access permissions) and XML sitemaps (which enumerate page URLs for search engine indexing), but adapted to the specific constraints and requirements of LLM inference workflows.
The core premise is that LLMs processing web content for retrieval-augmented generation (RAG) or direct ingestion face a different information need than traditional web crawlers: they benefit from concise, structured descriptions of what a site contains — including context about the site’s purpose, the nature of each section, and links to key content — in a format that fits efficiently within the token-limited context windows used during crawling or summarisation.
How It Works
The llms.txt file uses a Markdown-based format rather than XML or JSON. A typical file contains a brief top-level description of the site in an H1 heading, followed by a blockquote summary paragraph, and then organised sections of Markdown links pointing to the site’s most important pages. Each link can include a brief inline description.
The file may also be accompanied by an extended variant at /llms-full.txt, which includes the full text content of key pages rather than just links — useful for AI systems that can ingest longer documents in a single pass.
The convention is intentionally simple: no special syntax beyond standard Markdown, no mandatory fields beyond the site description and at least one linked URL, and no required registration or validation step. The specification is maintained at llmstxt.org and is designed to be implementable in minutes by any web publisher. CMS platforms including WordPress (via plugin), Astro, and Next.js have seen community-developed integrations that auto-generate llms.txt from existing site structure.
Unlike robots.txt, which instructs crawlers on access permissions (what they may or may not fetch), llms.txt is purely declarative and informational: it does not grant or restrict access but signals which content the site owner considers most important for AI systems to understand. There is no governing standards body (unlike robots.txt, which has a draft RFC — RFC 9309 — standardising the Robots Exclusion Protocol), and LLM crawler compliance with llms.txt is voluntary and varies by operator.
AI systems and products that have been reported to respect or consider llms.txt include Perplexity AI, various RAG-based research assistants, and some implementations of the OpenAI web browsing tool — though no major LLM provider has formally committed to treating it as a required standard as of 2025.
Where You Encounter It
The llms.txt convention is most commonly discussed in the intersection of SEO, AEO (Answer Engine Optimisation), and technical web publishing communities. It gained significant traction after Jeremy Howard’s initial proposal post in late 2024 was widely shared among developers, web publishers, and AI researchers.
For content-rich websites targeting visibility in AI-powered answer surfaces — including Google AI Overviews, Perplexity AI, ChatGPT’s web-browsing mode, Microsoft Copilot’s cited responses, and similar features — llms.txt represents a low-cost signal of content intent. It supplements rather than replaces existing discoverability mechanisms: structured data via Schema.org (particularly DefinedTerm, FAQPage, and HowTo types), XML sitemaps, and the semantic signals used by the E-E-A-T framework all remain the primary mechanisms by which both traditional search engines and AI systems evaluate and rank content.
Documentation and hosting platforms, API providers, and developer tool vendors have been among the earliest adopters, as their audience (developers building AI applications) is particularly receptive to the convention. SaaS product documentation sites, glossary collections, and knowledge bases are also well-suited to the format.
Practical Examples
A contest voting platform with an extensive glossary creates an llms.txt file at https://buyvotescontest.com/llms.txt. The file lists the site’s key glossary entries — SPF Record, DKIM, DMARC, Email Confirmation Vote, AI Overviews — with brief descriptions and direct URLs. An AI research assistant crawling the site as part of a RAG pipeline for a query about “email authentication for contest platforms” retrieves the llms.txt file, identifies the relevant glossary entries, and fetches their content pages directly rather than attempting to parse the site’s full HTML structure. The result is that the glossary entries are more accurately represented in the AI system’s responses than they would have been if the assistant had attempted to infer site structure from a general crawl.
A developer building an internal knowledge assistant for a marketing agency implements llms.txt parsing in their RAG pipeline, prioritising pages listed in llms.txt files when multiple pages from the same domain are retrieved for a given query. This gives content-rich publishers that maintain llms.txt files a small but consistent advantage in citation frequency within the assistant’s outputs.
Related Concepts
llms.txt operates at the layer of AI crawler communication, complementing the structured semantic vocabulary provided by Schema.org — which signals content type and entity relationships to both search engines and AI systems via JSON-LD — and the content quality signals evaluated by Google under the E-E-A-T framework and the Helpful Content Update classifier. For maximum AI discoverability, publishers are advised to maintain all three: a valid llms.txt content map, comprehensive Schema.org structured data, and content that meets the E-E-A-T and Helpful Content standards that govern citation eligibility in AI Overview and similar answer-engine features.