One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability
Some years ago I compared those boilerplate removal tools and I remember that jusText was giving me the best results out of the box (tried readability and few other libraries too). I wonder what is the state of the art today?
haddr|1 year ago
jot|1 year ago
With some configuration you can get most of the way there.
asadalt|1 year ago
jot|1 year ago
IanCal|1 year ago
chrisweekly|1 year ago
fbdab103|1 year ago
nyokodo|1 year ago
Wait, regexes are the epitome of black magic. What do you consider as black magic?
foundzen|1 year ago
asadm|1 year ago