top | item 45409001

Use the Accept Header to Serve Markdown Instead of HTML to LLMs

74 points| hahnbee | 5 months ago |skeptrune.com

61 comments

order

foxfired|5 months ago

I think there is a problem of incentive here. When we made our websites Search Engine Optimized, the incentive was for google to understand our content, and bring traffic our way. When you make your content optimized for LLM, it only improves their product, and you get nothing in return.

naet|5 months ago

I do dev work for a marketing dept of a large company and there is a lot of talk about optimizing for LLMs/AI. Chatgpt can drive sales in the same way a blog post indexed by Google can.

If a customer asks the AI what product can solve their problem and it replies with our product that is a huge win.

If your business is SEO spam with online ads, chatgpt might eat it. But if your business is selling some product, chatgpt might help you sell it.

CGamesPlay|5 months ago

But software documentation is a prime example of when the incentives don't have any problems. I want my docs to be more accessible to LLMs, so more people use my software, so my software gets more mindshare, so I get more paying customers on my enterprise support plan.

skeptrune|5 months ago

This isn't true. ChatGPT and Gemini link to sites in a similar way to how search engines have always done it. You can see the traffic show up in ahrefs or semrush.

foxyv|5 months ago

If you are selling advertising, then I agree. However, if you are selling a product to consumers then no. Ask an LLM "What is the best refrigerator on the market." You will get various answers like:

> The best refrigerator on the market varies based on individual needs, but top brands like LG and Samsung are highly recommended for their innovative features, reliability, and energy efficiency. For specific models, consider LG's Smart Standard-Depth MAX™ French Door Refrigerator or Samsung's smart refrigerators with internal cameras.

Optimizing your site for LLM means that you can direct their gestalt thinking towards your brand.

userbinator|5 months ago

And neither of those two ultimately help the humans who are actually looking for something. You have a finite amount of time to spend on optimising for humans, or for search engines (and now LLMs), and unfortunately many chose the latter and it's just lead to plenty of spam in the search results.

Yes, SEO can bring traffic to your site, but if your visitors see nothing of value, they'll quickly leave.

shpx|5 months ago

You get to live in a world where other people are slightly more productive.

burcs|5 months ago

Really cool idea

Humans get HTML, bots get markdown. Two tiny tweaks I’d make...

Send Vary: Accept so caches don’t mix Markdown and HTML.

Expose it with a Link: …; rel="alternate"; type="text/markdown" so it’s easy to discover.

Rohansi|5 months ago

Would be nice for humans to get the markdown version too. Once it's rendered you get a clean page.

yawaramin|5 months ago

This person hypermedias

skeptrune|5 months ago

There was a lot of conversation about this on X over the last couple days and the `Accept` request header including "text/markdown, text/plain" has emerged as kind of a new standard for AI agents requesting content such that they don't burn unnecessary inference compute processing HTML attributes and CSS.

- https://x.com/bunjavascript/status/1971934734940098971

- https://x.com/thdxr/status/1972421466953273392

- https://x.com/mintlify/status/1972315377599447390

hahnbee|5 months ago

keep us posted on how this change impacts your GEO!

Kimitri|5 months ago

The concept is called content negotiation. We used to do this when we wanted to serve our content as XHTML to clients preferring that over HTML. It's nice to see it return as I always thought it was quite cool.

skeptrune|5 months ago

Agreed! I love that such a tried and true web standard is making a comeback because of AI.

pabs3|5 months ago

Content negotiation is also good for choosing human languages, unfortunately the browser interfaces for it are terrible.

klodolph|5 months ago

I don’t understand why the agents requesting HTML can’t extract text from HTML themselves. You don’t have to feed the entire HTML document to your LLM. If that’s wasteful, why not have a little bit of glue that does some conversion?

simonw|5 months ago

Converting HTML into Markdown isn't particularly hard. Two methods I use:

1. The Jina reader API - https://jina.ai/reader/ - add r.jina.ai to any URL to run it through their hosted conversion proxy, eg https://r.jina.ai/www.skeptrune.com/posts/use-the-accept-hea...

2. Applying Readability.js and Turndown via Playwright. Here's a shell script that does that using my https://shot-scraper.datasette.io tool: https://gist.github.com/simonw/82e9c5da3f288a8cf83fb53b39bb4...

skeptrune|5 months ago

It's always better for the agent to have fewer tools and this approach means you get to avoid adding a "convert HTML to markdown" one which improves efficiency.

Also, I doubt most large-scale scrapers are running in agent loops with tool calls, so this is probably necessary for those at a minimum.

stebalien|5 months ago

Or one can just use semantic HTML; it's easy enough to convert semantic HTML into markdown with a tool like pandoc. That would also help screen readers, browser "reader modes", text-based web browsers, etc.

NathanFlurry|5 months ago

We’re doing this on https://rivet.dev now. I did not realize how much context bloat we had since we were using Tailwind.

skeptrune|5 months ago

It is crazy how badly Tailwind bloats HTML. Tradeoffs!

anabis|5 months ago

OpenAI cookbook says LLMs understand XML better than Markdown text, so maybe that also? Although, it should be more specified and structured, but not HTML.

onion2k|5 months ago

OpenAI cookbook says LLMs understand XML better than Markdown text.

Yes, for prompts. Given how little XML is out on the public internet it'd be surprising if it also applies to data ingestion from web scraping functions. It'd be odd if Markdown works better than HTML to be honest, but maybe Markdown also changes the content being served e.g. there's no menu, header, or footer sent with the body content.