top | item 46587490 (no title) fbouvier | 1 month ago Yes HTML is too heavy and too expensive for LLM. We are working on a text-based format more suitable for AI. discuss order hn newest httpteapot|1 month ago What do you think of the DeepSeek OCR approach where they say that vision tokens might better compress a document than its pure text representation?https://news.ycombinator.com/item?id=45640594I've spent some time feeding llm with scrapped web pages and I've found that retaining some style information (text size, visibility, decoration image content) is non trivial. fbouvier|1 month ago Keeping some kind of style information is definitely important to understand the semantics of the webpage.
httpteapot|1 month ago What do you think of the DeepSeek OCR approach where they say that vision tokens might better compress a document than its pure text representation?https://news.ycombinator.com/item?id=45640594I've spent some time feeding llm with scrapped web pages and I've found that retaining some style information (text size, visibility, decoration image content) is non trivial. fbouvier|1 month ago Keeping some kind of style information is definitely important to understand the semantics of the webpage.
fbouvier|1 month ago Keeping some kind of style information is definitely important to understand the semantics of the webpage.
httpteapot|1 month ago
https://news.ycombinator.com/item?id=45640594
I've spent some time feeding llm with scrapped web pages and I've found that retaining some style information (text size, visibility, decoration image content) is non trivial.
fbouvier|1 month ago