llms.txt

परिभाषा

llms.txt एक informal open standard है एक plain-text file के लिए जो एक website के root पर path /llms.txt पर hosted होती है, large language model (LLM) crawlers और AI-powered discovery systems के लिए targeted एक structured, human- और machine-readable content map प्रदान करने के लिए designed है। Convention 2024 में Jeremy Howard, fast.ai के co-founder और applied deep learning में एक prominent figure, द्वारा robots.txt (जो crawler access permissions govern करता है) और XML sitemaps (जो search engine indexing के लिए page URLs enumerate करते हैं) के analogue के रूप में propose किया गया था, लेकिन LLM inference workflows की specific constraints और requirements के लिए adapted।

Core premise यह है कि retrieval-augmented generation (RAG) या direct ingestion के लिए web content process करने वाले LLMs traditional web crawlers की तुलना में एक different information need का सामना करते हैं: वे यह concise, structured descriptions से लाभ प्राप्त करते हैं कि एक site में क्या है — जिसमें site के purpose के बारे में context, प्रत्येक section की nature, और key content के links शामिल हैं — एक format में जो crawling या summarisation के दौरान उपयोग किए जाने वाले token-limited context windows के अंदर efficiently fit होता है।

यह कैसे काम करता है

llms.txt file XML या JSON के बजाय एक Markdown-based format का उपयोग करता है। एक typical file में H1 heading में site का एक brief top-level description होता है, उसके बाद एक blockquote summary paragraph होता है, और फिर site के सबसे important pages की ओर pointing करने वाले Markdown links के organised sections होते हैं। प्रत्येक link एक brief inline description include कर सकता है।

File /llms-full.txt पर एक extended variant के साथ accompany भी हो सकती है, जिसमें केवल links के बजाय key pages की पूरी text content शामिल होती है — उन AI systems के लिए useful जो single pass में longer documents ingest कर सकते हैं।

Convention intentionally simple है: standard Markdown से परे कोई special syntax नहीं, site description और कम से कम एक linked URL से परे कोई mandatory fields नहीं, और कोई required registration या validation step नहीं। Specification को llmstxt.org पर maintain किया जाता है और किसी भी web publisher द्वारा minutes में implementable होने के लिए designed है। CMS platforms जिनमें WordPress (plugin के माध्यम से), Astro, और Next.js शामिल हैं, ने community-developed integrations देखी हैं जो existing site structure से llms.txt को auto-generate करते हैं।

robots.txt के विपरीत, जो crawlers को access permissions पर instruct करता है (वे क्या fetch कर सकते हैं या नहीं), llms.txt purely declarative और informational है: यह access grant या restrict नहीं करता बल्कि signal देता है कि site owner किस content को AI systems के समझने के लिए सबसे important मानता है। कोई governing standards body नहीं है (robots.txt के विपरीत, जिसके पास एक draft RFC है — RFC 9309 — जो Robots Exclusion Protocol को standardising करता है), और llms.txt के साथ LLM crawler compliance voluntary है और operator के अनुसार vary करती है।

AI systems और products जो llms.txt का respect या consider करने के लिए reported हैं, उनमें Perplexity AI, विभिन्न RAG-based research assistants, और OpenAI web browsing tool के कुछ implementations शामिल हैं — हालांकि 2025 तक किसी major LLM provider ने इसे एक required standard के रूप में treat करने के लिए formally commit नहीं किया है।

आप इसे कहाँ देखते हैं

llms.txt convention सबसे commonly SEO, AEO (Answer Engine Optimisation), और technical web publishing communities के intersection में discussed होती है। यह late 2024 में Jeremy Howard के initial proposal post के developers, web publishers, और AI researchers के बीच widely shared होने के बाद significant traction प्राप्त हुई।

AI-powered answer surfaces में visibility को target करने वाली content-rich websites के लिए — जिनमें Google AI Overviews, Perplexity AI, ChatGPT का web-browsing mode, Microsoft Copilot की cited responses, और similar features शामिल हैं — llms.txt content intent का एक low-cost signal प्रस्तुत करता है। यह existing discoverability mechanisms को supplement करता है replace नहीं: Schema.org के माध्यम से structured data (विशेष रूप से DefinedTerm, FAQPage, और HowTo types), XML sitemaps, और E-E-A-T framework द्वारा उपयोग किए जाने वाले semantic signals सभी traditional search engines और AI systems द्वारा content का evaluate और rank करने के primary mechanisms बने रहते हैं।

Documentation और hosting platforms, API providers, और developer tool vendors सबसे शुरुआती adopters में रहे हैं, क्योंकि उनके audience (AI applications build करने वाले developers) convention के लिए विशेष रूप से receptive हैं। SaaS product documentation sites, glossary collections, और knowledge bases भी format के लिए well-suited हैं।

व्यावहारिक उदाहरण

एक extensive glossary वाला एक contest voting platform https://buyvotescontest.com/llms.txt पर एक llms.txt file बनाता है। File site के key glossary entries — SPF Record, DKIM, DMARC, Email Confirmation Vote, AI Overviews — को brief descriptions और direct URLs के साथ list करती है। “contest platforms के लिए email authentication” के बारे में एक query के लिए RAG pipeline के हिस्से के रूप में site को crawl करने वाला एक AI research assistant llms.txt file retrieve करता है, relevant glossary entries identify करता है, और site की पूरी HTML structure parse करने का प्रयास करने के बजाय directly उनकी content pages fetch करता है। परिणाम यह है कि glossary entries AI system के responses में अधिक accurately represented होती हैं than they would have been यदि assistant ने एक general crawl से site structure infer करने का प्रयास किया होता।

एक marketing agency के लिए एक internal knowledge assistant build कर रहा एक developer अपनी RAG pipeline में llms.txt parsing implement करता है, एक given query के लिए same domain से multiple pages retrieve होने पर llms.txt files में listed pages को prioritise करता है। यह llms.txt files maintain करने वाले content-rich publishers को assistant के outputs के अंदर citation frequency में एक small लेकिन consistent advantage देता है।