Insights · AI Search

llms.txt, schema and the machine-readable brand.

What llms.txt actually is, what JSON-LD schema does, and how to build the machine-readable layer of a brand — in a practical implementation order.

01The article

Every brand now publishes for two audiences. The first is human — customers who read the pages, watch the videos, form an impression. The second is mechanical — crawlers, parsers and language models that ingest the same website and reconstruct the brand from whatever they can extract. The first audience gets the design budget. The second, at most organisations, gets nothing at all.

That gap matters because the second audience increasingly briefs the first. When an AI engine answers a buying question, it describes brands based on what its machinery could retrieve and parse — not on what the brand meant. A business that is legible to machines gets described accurately. A business that is not gets described approximately, or gets left out while a more legible competitor becomes the answer.

The machine-readable layer of a brand has three components worth building deliberately — llms.txt, structured data markup, and entity consistency across the web — plus one access decision that sits underneath all of them. Each deserves an honest account, because the market currently oversells one of them and underexplains the rest.

What the machine-readable layer is

A website carries two descriptions of the business in parallel. The visible one is the copy — written for people, shaped by persuasion, full of implication. The machine-readable one is everything a parser can extract without judgement: structured markup that types each fact, plain-text files that index the site, crawler directives that govern access, and the pattern of consistent facts a model can cross-check against the rest of the web.

Models lean on the second description because it removes doubt. A human reader infers that a firm is Brisbane-based from a footer, a photo and a phone prefix. A model wants the fact stated — in markup, in text, in the third-party record — before it repeats the fact in an answer someone else will rely on. The machine-readable layer is where a brand states its facts in a form that survives extraction.

None of this replaces the visible copy. It disciplines it. The exercise of making a brand machine-readable is mostly the exercise of deciding what the facts actually are — what the business does, for whom, where, at what standard — and stating them the same way everywhere.

llms.txt, explained honestly

llms.txt is a plain-text file, written in markdown, placed at the root of a website — the same position robots.txt has occupied for decades. The convention was proposed by Jeremy Howard of Answer.AI as a way to help language models use a website at inference time: where robots.txt tells crawlers what they may access, llms.txt tells models what is worth reading. The format, documented at llmstxt.org, is deliberately simple — a heading naming the site, a short blockquote summary, then sections of curated links with one-line descriptions of what each page contains.

The honest part comes next. llms.txt is a proposed convention, not an established standard. No major AI engine has publicly committed to consuming the file, and none is obliged to. Anyone selling llms.txt as a switch that makes models cite a brand is selling something the evidence does not support. The file is an offer of legibility, not a guarantee of use.

What justifies it is the asymmetry. The cost is roughly an hour of work and a few hundred words of maintained text. The downside is nil — the file harms nothing, conflicts with nothing and requires no engineering. The upside, if engines consume it now or later, is a curated index of the brand’s authoritative pages, written by the brand, in the format models parse most readily. Cheap insurance against an outcome that is plausible and expensive to miss is a reasonable purchase. It just should not be described as more than that.

The discipline of writing one is worth more than the file itself. A good llms.txt forces the questions most websites never answer plainly: what is this business in one sentence, which pages state its core facts, and what should a machine read first? sampark.com.au maintains its own at sampark.com.au/llms.txt — a one-paragraph description of the practice followed by indexed links to every service, industry and publication page, each with a single factual line. Whether or not any engine reads it this week, it is the brand’s facts in their most extractable form, and everything in it agrees with the pages it points to.

Schema: the layer that is actually consumed

If llms.txt is the speculative end of the machine-readable layer, JSON-LD schema markup is the established end. Structured data has been documented, consumed and rewarded by search systems since schema.org launched in 2011, and the generative layer retrieves from the same index those systems built. Markup is not a bet on future behaviour; it is participation in current behaviour.

Schema does one job: it types facts. Visible copy says things; markup declares what kind of thing is being said, and what entity each fact attaches to. Four types do most of the work for a services brand.

Organization declares the legal and trading identity — name, URL, logo, location, contact points, and the sameAs links that bind the website to its profiles elsewhere. This is the anchor entity every other fact hangs off, and the markup that stops a model confusing one similarly named business with another.

Person declares the humans behind the expertise — who they are, what they do, which organisation they belong to. In advice-led categories, where trust attaches to named people rather than logos, Person markup connects the individual’s credentials to the entity the engines are evaluating.

Service declares what is actually sold — the service type, the provider, the area served. It converts a persuasive services page into a set of typed facts: this organisation provides this service in this market.

FAQPage declares question-and-answer pairs in a form built for extraction. A marked-up answer is a passage that already knows which question it answers — which is precisely the unit answer engines assemble responses from.

One rule governs all four: markup must agree with the visible copy it describes. Schema that claims what the page does not say is a contradiction handed directly to systems whose core habit is cross-checking. The markup removes doubt only when it and the copy tell one story.

Entity consistency: the third leg

A brand’s machine-readable layer does not stop at its own domain. Models cross-check. A fact stated on the website and echoed by directories, professional profiles, industry press and registries reads as settled; a fact that exists in one place reads as an assertion. And where the sources conflict — an old address in a directory, a stale service list on a profile, a former business name in a registry — the model either hedges or picks a version, and it may not pick yours.

Entity consistency is the unglamorous work of making the public record agree with itself: the same organisation name, the same description of what the business does, the same locations and claims, everywhere the brand is written down. It is the third leg because the first two depend on it — an llms.txt and a schema layer that contradict the wider web do not resolve doubt, they document it.

The test is simple to run and worth running before anything else is built. Search the brand, read every listing on the first pages of results, and note each fact that differs from the current truth. Most established businesses find years of drift. Each inconsistency is a reason for a cautious system to describe the brand vaguely — or describe a competitor instead. A fuller version of this exercise is set out in how to audit your own AI search visibility.

The access decision underneath: robots.txt and AI crawlers

Beneath all three layers sits a decision many organisations have made only by default: which AI crawlers may read the site at all. The major systems identify themselves — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot among them — and each can be admitted or refused, individually, through ordinary robots.txt directives.

The trade-off deserves to be stated neutrally, because there are legitimate positions on both sides. Blocking AI crawlers protects content from uncompensated ingestion — a rational stance for publishers whose content is the product, and the position several large media companies have taken. The cost is absence: a model that cannot read a site cannot cite it, describe it from primary sources or ground an answer in it, and the brand’s presence in AI answers then depends entirely on what third parties say.

Allowing AI crawlers accepts ingestion in exchange for representation. For most commercial brands — businesses whose content exists to win customers rather than to be the product — that exchange favours access, because the commercial risk of being absent from answers usually outweighs the value of withholding marketing pages. But it is a decision, and it should be made deliberately, crawler by crawler, rather than inherited from a default nobody chose. Whatever the position, it belongs in robots.txt explicitly, so the policy is legible too.

A practical implementation order

The components are cheap individually and compounding together. Sequence them by dependency, not by novelty.

  1. Settle the facts. Write the canonical fact base first — what the business is, does, serves and claims, in plain declarative sentences. Every later layer is a projection of this document.
  2. Decide crawler access. Set robots.txt deliberately for the named AI crawlers, in whichever direction the business chooses. Access shapes everything downstream.
  3. Build the schema layer. Organization first, then Person, Service and FAQPage — validated, and reconciled word-for-word with the visible copy.
  4. Reconcile the third-party record. Correct directories, profiles and listings until the public record tells one story. This is the slowest step, which is why it starts before the novel ones.
  5. Publish llms.txt. Last, because it should index a site whose facts are already in order — a curated map is only useful when the territory is sound.
  6. Verify in the engines. Query the AI systems with the brand’s buying questions, record how it is described, and repeat on a schedule. The machine-readable layer is working when the answers get the brand right.

The order embodies the point. llms.txt — the piece the market talks about most — comes last, because it is the cheapest and least certain. The facts, the markup and the consistent record come first, because those are what the engines demonstrably use.

Machine readability is one half of a larger discipline: generative engine optimisation — making a brand’s facts, claims and content easy for generative AI models to retrieve, trust and reproduce accurately. The layers described here supply the retrieval side; what models do with well-supplied facts, and how citation is earned and measured, is the rest of the practice. How that runs as a structured engagement is set out on the generative engine optimisation service page.

03Contact

Let’s talk about what’s next.

For executive advisory, fractional CMO, AI search strategy or speaking enquiries.

sam@sampark.com.au
Brisbane, Australia
Enquiry form