top | item 41294229

(no title)

Treesrule14 | 1 year ago

There are a lot of webcrawlers where the chief feature is turning the website into markdown, I don't quite understand what they are doing for me thats useful since I can just do something like `markdownify(my_html)` or whatever, all this to say is that I wouldn't find this useful, but also clearly people think this is a useful feature as part of an LLM pipeline.

discuss

order

loa_in_|1 year ago

You don't want the footer or navigation in the output. Ideally you want the main content of the page, if it exists. How do you assign header level if they're only differentiated by CSS left-margin in a variety of units? How do you interpret documents that render properly but are hardly correct HTML?

Treesrule14|1 year ago

Thanks, I guess, none of that stuff seemed super useful to cut systematically, but I'm gonna run some tests.