Hi HN, I built SDF (Structured Data Format), an open protocol that sits between web content and AI agents.
The problem: Every agent that consumes a web page independently fetches HTML, strips boilerplate, extracts entities, and classifies content. A typical page is ~89KB of HTML (~73K tokens). When 100 agents consume the same URL, this extraction happens 100 times with inconsistent results.
What SDF does: Convert once into a schema-validated JSON document (~750 tokens) containing entities, claims, relationships, summaries, and type-specific structured data. Agents consume the pre-extracted representation directly.
Results from production deployment (2,335 documents, 10 content types):
99.2% token reduction from HTML
90% extraction accuracy with fine-tuned 1.5B + 3B model cascade
4.1x faster than monolithic 14B baseline
Downstream experiment: general-purpose 7B model scores 0.739 accuracy from SDF vs 0.352 from raw markdown (p < 0.05)
The pipeline runs locally on consumer hardware (dual RTX 3090 Ti). Fine-tuned models are open on HuggingFace (sdfprotocol/sdf-classify, sdfprotocol/sdf-extract). Protocol spec and JSON schemas are on GitHub.
I wonder if people will eventually surf ad-free by sniffing out these files. Easy to parse (maybe even easier than the actual article itself) and no ads or otherwise unrelated distractions.
spranab|20 days ago
The problem: Every agent that consumes a web page independently fetches HTML, strips boilerplate, extracts entities, and classifies content. A typical page is ~89KB of HTML (~73K tokens). When 100 agents consume the same URL, this extraction happens 100 times with inconsistent results.
What SDF does: Convert once into a schema-validated JSON document (~750 tokens) containing entities, claims, relationships, summaries, and type-specific structured data. Agents consume the pre-extracted representation directly.
Results from production deployment (2,335 documents, 10 content types):
99.2% token reduction from HTML 90% extraction accuracy with fine-tuned 1.5B + 3B model cascade 4.1x faster than monolithic 14B baseline Downstream experiment: general-purpose 7B model scores 0.739 accuracy from SDF vs 0.352 from raw markdown (p < 0.05) The pipeline runs locally on consumer hardware (dual RTX 3090 Ti). Fine-tuned models are open on HuggingFace (sdfprotocol/sdf-classify, sdfprotocol/sdf-extract). Protocol spec and JSON schemas are on GitHub.
Protocol spec + schemas: https://github.com/sdfprotocol/sdf Whitepaper: https://doi.org/10.5281/zenodo.18559223 Models: https://huggingface.co/sdfprotocol Happy to answer questions about the design decisions, the type system, or the evaluation methodology.
ksaj|20 days ago
spranab|20 days ago