top | item 46526809

Show HN: 30k IKEA items in flat text

55 points| tsazan | 1 month ago |huggingface.co

OP here.

I took the unofficial IKEA US dataset (originally scraped by jeffreyszhou) and converted all 30,511 products into a flat, markdown-like protocol called CommerceTXT.

The goal: See if a flatter structure is more efficient for LLM context windows.

The results: - Size: 30k products across 632 categories. - Efficiency: The text version uses ~24% fewer tokens (3.6M saved total) compared to the equivalent minified JSON. - Structure: Files are organized in folders (e.g. /products/category/), which helps with testing hierarchical retrieval routers.

The link goes to the dataset on Hugging Face which has the full benchmarks.

Parser code is here: https://github.com/commercetxt/commercetxt

Happy to answer questions about the conversion logic!

34 comments

vachina|1 month ago

There’s already a schema.org spec that defines a JSON-LD structured data that you can embed on every of your product page to provide a machine readable interface of your product.

For example, Google’s indexers already use this to surface pricing data. https://developers.google.com/search/docs/appearance/structu...

tsazan|1 month ago

That`s is valid for search engines. But if JSON-LD was sufficient for agents, Google wouldn't have launched UCP (Universal Commerce Protocol) yesterday.

reddalo|1 month ago

I don't understand why new proposed standards are still polluting the root namespace (also see llms.txt).

These things should be put under /.well-known [1], not in the root.

[1] https://en.wikipedia.org/wiki/Well-known_URI

buildbuildbuild|1 month ago

User friendliness. I’ve seen several less-technical people able to quickly access, create, and understand “llms.txt”.

It’s not ideal but representative of the tension between user experience and technical correctness.

dkdcio|1 month ago

I was not aware you shouldn’t do that — what’s the rationale/historical context?

btrettel|1 month ago

Interesting. I had been thinking recently about grep-friendly structured text file formats given the constraints of regex. But I hadn't considered that you could design a structured text file format to be LLM-friendly given token constraints.

tsazan|1 month ago

You're right.If a format is easy to grep, it is almost always cheap to tokenize. We treat token density as a primary design constraint.

JosephRedfern|1 month ago

I've heard that LLMs can perform worse with these more efficient representations compared to e.g. JSON, because they've seen far fewer examples of them during training. Do you know how true that is?

TechSquidTV|1 month ago

Absolutely, but usually when working with a bespoke format for optimization, it's paired with an LLM specifically trained on that format.

tsazan|1 month ago

You are right about cryptic formats. CommerceTXT is semantically structured. Models like GPT, Claude and Gemini understand it out-of-the-box via ICL.

sognetic|1 month ago

Interesting! So did you do any experiments on a relevant subset of the data to test whether LLM performance degrades by introducing a new, presumably unknown to the LLM, format?

tsazan|1 month ago

The 24% token savings come from converting JSON syntax to CommerceTXT.

colinbartlett|1 month ago

Any practical use for this IKEA data specifically?

Or just a handy open data set you could use to prove out the concept?

DennisP|1 month ago

I assumed it's because IKEA is famous for flat packing its furniture.

WildGreenLeave|1 month ago

I've had the idea to setup an AI that automatically (re)designs a room using IKEA stuff. It would definitely help me decorate my room in a better way.

bleonard|1 month ago

A blast from the past. When Taskrabbit was acquired by IKEA, I built several tools that went through the whole catalog via various crawling approaches. One tool was to estimate how long it would be to put each item together for an initial training set.

croisillon|1 month ago

years ago i did a small tool that, when you entered a product number, would scan all IKEA-websites with currency Euro and return the prices for each of them ; not that i expected furniture tourism to become a thing but it was funny

tsazan|1 month ago

Reminds me of a friend who built a comment sentiment analyzer years ago. At the time, it looked like great innovation...

chuckadams|1 month ago

Well of course it would be flat text... ;)

unknown|1 month ago

[deleted]

usefulposter|1 month ago

"OP here" is the funniest tell that shows up when using an LLM to write a post for HN or Reddit.

It's funny because it makes zero sense in the body of an initial post!

In comments replying to people downthread - maybe. But opening a top-level post with "Original Poster here" is just silly and shows a lack of respect for community etiquette.

https://hn.algolia.com/?dateRange=pastYear&page=0&prefix=tru...

dkoy|1 month ago

Good catch, think you’re on to something

tokai|1 month ago

I just understand it as lightly humorous. Like starting a anecdote with

>be me

Seeing it as a lack of respect is a huge stretch. And kinda conceited that you accuse someone of such, on the basis of a two word opener.