top | item 41043771

Show HN: Convert HTML DOM to semantic markdown for use in LLMs

146 points| leroman | 1 year ago |github.com

56 comments

order
[+] mistercow|1 year ago|reply
This is cool. When dealing with tables, you might want to explore departing from markdown. I’ve found that LLMs tend to struggle with tables that have large numbers of columns containing similar data types. Correlating a row is easy enough, because the data is all together, but connecting a cell back to its column becomes a counting task, which appears to be pretty rough.

A trick I’ve found seems to work well is leaving some kind of id or coordinate marker on each column, and adding that to each cell. You could probably do that while still having valid markdown if you put the metadata in HTML comments, although it’s hard to say how an LLM will do at understanding that format.

[+] michaelmior|1 year ago|reply
SpreadsheetLLM[0] might be worth looking into. It's designed for Excel (and similar) spreadsheets, so I'd imagine you could do something far simpler for the majority of HTML tables.

[0] https://arxiv.org/abs/2407.09025v1

[+] msnkarthik|1 year ago|reply
You're spot on about the challenges LLMs face with complex markdown tables, especially when column counts rise and data types are similar. The "counting task" for column correlation is a real pain point – it's like the LLM loses track of where it is in the data grid. Your ID/coordinate marker idea is clever! It provides explicit context that LLMs seem to crave. Using HTML comments for this metadata is an interesting approach. It keeps the markdown valid for human readability, but I share your uncertainty about how consistently LLMs would parse and utilize it. Some other avenues worth exploring: Alternative Formats: Have you experimented with formats like CSV or JSON for feeding tabular data to LLMs? They might offer a more structured representation that's easier to parse. Pre-processing: Could we pre-process the table to create a more LLM-friendly representation? For example, converting it into a list of dictionaries, where each dictionary represents a row and keys represent column names. Prompt Engineering: Perhaps there are specific prompts or instructions that can guide LLMs to better handle large tables within markdown. It seems like there's room for innovation in how we bridge the gap between human-readable markdown tables and the structured data LLMs thrive on.
[+] mattding|1 year ago|reply
Do you have any numbers re-markdown performance, or is this anecdotal? I'm running a similar experiment right now and would love to hear anything else you've tried.
[+] leroman|1 year ago|reply
Thanks for sharing, will look into adding this as a flag in the options!
[+] gmaster1440|1 year ago|reply
> Semantic Clarity: Converts web content to a format more easily "understandable" for LLMs, enhancing their processing and reasoning capabilities.

Are there any data or benchmarks available that show what kind of text content LLMs understand best? Is it generally understood at this point that they "understand" markdown better than html?

[+] mistercow|1 year ago|reply
I haven’t found any specific research, but I suspect it’s actually the opposite, particularly for models like Claude, which seem to have been specifically trained on XML-like structures.

My hunch is that the fact that HTML has explicit matching closing tags makes it a bit easier for an LLM to understand structure, whereas markdown tends to lean heavily on line breaks. That works great when you’re viewing the text as a two dimensional field of pixels, but that’s not how LLMs see the world.

But I think the difference is fairly marginal, and my hunch should be taken with a grain of salt. From experience, all I can say is that I’ve seen stripped down HTML work fine, and I’ve seen markdown work fine. The one place where markdown clearly shines is that it tends to use fewer tokens.

[+] leroman|1 year ago|reply
Author here- it's a good point to have some benchmarks (which I don't have..) but I think it's well understood that minimizing noise by reducing tokens will improve the quality of the answer. And I think by now LLMs are well versed in Markdown, as it's the preferred markup language used when generating responses
[+] sigmoid10|1 year ago|reply
They understand best whatever was used during their training. For OpenAI's GPTs we don't really know since they don't disclose anything anymore, but there are good reasons to assume they used markdown or something closely related.
[+] DeveloperErrata|1 year ago|reply
It's neat to see this getting attention. I've used similar techniques in production RAG systems that query over big collections of HTML docs. In our case the primary motivator was higher token efficiency (ie to represent the same semantic content but with a smaller token count).

I've found that LLMs are often bad at understanding "standard" markdown tables (of which there are many different types, see: https://pandoc.org/chunkedhtml-demo/8.9-tables.html). In our case, we found the best results when keeping the HTML tables in HTML, only converting the non-table parts of the HTML doc to markdown. We also strip the table tags of any attributes, with the exception of colspan and rowspan which are semantically meaningful for more complicated HTML tables. I'd be curous if there are LLM performance differences between the approach the author uses here (seems like it's based on repeating column names for each cell?) and just preserving the original HTML table structure.

[+] richardreeze|1 year ago|reply
This is really cool. I've already implemented it in one of my tools (I found it to work better than the Turndown/ Readability combination I was previously using).

One request: It would be great if you also had an option for showing the page's schema (which is contained inside the HTML).

[+] leroman|1 year ago|reply
Thanks for sharing!!

Would be really helpful if you opened an issue in Github with a specific example, happy to look into that!

[+] la_fayette|1 year ago|reply
The scoring approach seems interesting to extract the main content of web pages. I am aware of the large body of decades of research on that subject, with sophisticated image or nlp based approaches. Since this extraction is critical to the quality of the LLM response, it would be good to know how well this performs. E.g., you could test it against a test dataset (https://github.com/scrapinghub/article-extraction-benchmark). Also, you could provide the option to plugin another extraction algorithm, since there are other implementations available... just some ideas for improvement...
[+] leroman|1 year ago|reply
This totally makes sense, I will look into adding support for additional ways to detect the main content, super interesting!
[+] gradientDissent|1 year ago|reply
Nice work. Main content extraction based on the <main> tag won’t work with most of the web pages these days. Arc90 could help.
[+] nvartolomei|1 year ago|reply
While I was writing a tool for myself to summarise daily the top N posts from HN, Google Trends, and RSS feed subscriptions I had the same problem.

The quick solution was to use beautiful soup and readability-lxml to try and get the main article contents and then send it to an LLM.

The results are ok when the markup is semantic. Often it is not. Then you have tables, images, weirdly positioned footnotes, etc.

I believe the best way to extract information the way it was intended to be presented is to screenshot the page and send it to a multimodal LLM for “interpretation”. Anyone experimented with that approach?

——

The aspiration goal for the tool is to be the Presidential Daily Brief but for everyone.

[+] KolenCh|1 year ago|reply
I am curious how it would compare to using pandoc with readability algorithm for example.
[+] leroman|1 year ago|reply
Bumped this together with the side-by-side comparison task.. so will look into it :)
[+] alexliu518|1 year ago|reply
Converting web pages to Markdown is a common requirement. I have found that turndown does a good job, but it cannot meet the needs of all dynamic web page content. As far as I know, if you need to process dynamic web pages, you need targeted adaptation, such as Google extensions such as Web2Markdown.
[+] throwthrowuknow|1 year ago|reply
Thank you! I’m always looking for new options to use for archiving and ingesting web pages and this looks great! Even better that it’s an npm package!
[+] jejeyyy77|1 year ago|reply
hah, out of curiosity, what are you archiving and ingesting webpages for?
[+] nbbaier|1 year ago|reply
This is really cool! Any plans to add Deno support? This would be a great fit for environments like val.town[0], but they are based on a Deno runtime and I don't think this will work out of the box.

Also, when trying to run the node example from your readme, I had to use `new dom.window.DOMParser()` instead of `dom.window.DOMParser`

[0]: https://val.town

[+] leroman|1 year ago|reply
Afraid to say that other than bumping into a talk about Deno, I haven’t played around with it yet.. So thanks for the heads up, will look into it.

Thanks for the bug report !

[+] KolenCh|1 year ago|reply
Does anyone compare the performance between HTML input and other formats? I did an informal comparison and from a few tests it seems the HTML input is better. I thought having markdown input would be more efficient too but I’d like to see more systematic comparison to see it is the case.
[+] brightvegetable|1 year ago|reply
This is great, I was just in need of something like this. Thank!
[+] explosion-s|1 year ago|reply
How is this different than any other HTML to markdown library, like Showdown or Turndown? Is there any specific features that make it better for LLMS specifically instead of just converting HTML to MD?
[+] leroman|1 year ago|reply
Will add some side-by-side comparisons soon! the goal is not just to translate 1:1 HTML to markdown but to preserve any semantic information, this is generally not the goal for these tools. Some specific features and examples are in the README, like URL minification and optional main section detection and extraction (ignoring footer / header stuff).
[+] Layvier|1 year ago|reply
Nice, we have this exact use case for data extraction from scraped webpages. We've been using html-to-md, how does it compare to it?
[+] DevX101|1 year ago|reply
Problem is, with modern websites, everything is a div and you can't necessarily infer semantic meaning from the DOM elements.
[+] leroman|1 year ago|reply
After removing the noise you can distill the semantic stuff where ever possible, like meta-deta from images, buttons, etc, and see some structures emerge like footers and nav and body.. And many times for the sake of SEO and accessibility, websites do adopt quite a bit of semantic HTML elements and annotations in respective tags..
[+] goatlover|1 year ago|reply
What happened to using the semantic elements? Did that fall out of favor or the push for it get abandoned because popular frameworks just generate divs with semantic classes (hopefully)?
[+] ianbicking|1 year ago|reply
This is a great idea! There's an exceedingly large amount of junk in a typical HTML page that an LLM can't use in any useful way.

A few thoughts:

1. URL Refification[sic] would only save tokens if a link is referred to many times, right? Otherwise it seems best to keep locality of reference. Though to the degree that URLs are opaque to the LLM, I suppose they could be turned into references without any destination in the source at all, and if the LLM refers to a ref link you just look it up the real link in the mapping.

2. Several of the suggestions here could be alternate serializations of the AST, but it's not clear to me how abstract the AST is (especially since it's labelled as htmlToMarkdownAST). And now that I look at the source it's kind of abstract but not entirely: https://github.com/romansky/dom-to-semantic-markdown/blob/ma... – when writing code like this I also find keeping the AST fairly abstract also helps with the implementation. (That said, you'll probably still be making something that is Markdown-ish because you'll be preserving only the data Markdown is able to represent.)

3. With a more formal AST you could replace the big switch in https://github.com/romansky/dom-to-semantic-markdown/blob/ma... with a class that can be subclassed to override how particular nodes are serialized.

4. But I can also imagine something where there's a node type like "markdown-literal" and to change the serialization someone could, say, go through and find all the type:"table" nodes and translate them into type:"markdown-literal" and then serialize the result.

5. A more advanced parsing might also turn things like headers into sections, and introduce more of a tree of nodes (I think the AST is flat currently?). I think it's likely that an LLM would follow `<header-name-slug>...</header-name-slug>` better than `# Header Name\n ....` (at least sometimes, as an option).

6. Even fancier if, running it with some full renderer (not sure what the options are these days), and you start to use getComputedStyle() and heuristics based on bounding boxes and stuff like that to infer even more structure.

7. Another use case that could be useful is to be able to "name" pieces of the document so the LLM can refer to them. The result doesn't have to be valid Markdown, really, just a unique identifier put in the right position. (In a sense this is what URL reification can do, but only for URLs?)

[+] leroman|1 year ago|reply
This is some great feedback, thanks!

1. there some crazy links with lots of arguments and tracking stuff in them, so it gets very long, the refification turns them into a numbered "ref[n]" scheme, where you also get a map of ref[n]->url to do reverse translation.. it really saves a lot, in my experience. It's also optional, so you can be mindful when you want to use this feature..

2. I tried to keep it domain specific (not to reinvent HTML...) so mostly Markdown components and some flexibility to add HTML elements (img, footer etc).

3. Not sure I'm sold with replacing the switch, it's very useful there because of the many fall through cases.. I find it maintainable but if you point me to some specific issue there it would help

4. There are some built in functions to traverse and modify the AST. It is just JSON in the end of the day so you could leverage the types and write your own logic to parse it, as long as it conforms to the format you can always serialize it, as you mentioned..

5. The AST is recursive so not flat.. sounds like you want to either write your own AST->Semantic-Markdown implementation or plug into the existing one so I'll this in mind in the future

6. Sounds cool but out of scope at the moment :)

7. This feature would serve to help with scraping and kind of point the LLM to some element? Then the part I'm missing is how you would code this in advance.. There could be some meta-data tag you could add and it would be taken through the pipeline and added on the other side to the generated elements in some way..