top | item 40033654

(no title)

One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability

discuss

haddr|1 year ago

Some years ago I compared those boilerplate removal tools and I remember that jusText was giving me the best results out of the box (tried readability and few other libraries too). I wonder what is the state of the art today?

jot|1 year ago

This is worth having a look at: https://mixmark-io.github.io/turndown/

With some configuration you can get most of the way there.

asadalt|1 year ago

oh AI is optional here. I do use readability to clean the html before converting to .md.

jot|1 year ago

Last time I tried readability it worked well with articles but struggled with other kinds of pages. Took away far more content than I wanted it to.

IanCal|1 year ago

How do you achieve the same things without AI here using that tool?

chrisweekly|1 year ago

"How do you do it without AI" is a question I (sadly) expect to see more often.

fbdab103|1 year ago

I was honestly expecting it to be mostly black magic, but it looks like the meat of the project is a bunch of (surely hard won) regexes. Nifty.

nyokodo|1 year ago

> I was … expecting it to be mostly black magic, but … the meat of the project is a bunch of … regexes

Wait, regexes are the epitome of black magic. What do you consider as black magic?

foundzen|1 year ago

how is it compared to mozilla/readability?

asadm|1 year ago

it uses readibility but does some additional stuff like relink images to local paths etc., which I needed