top | item 30868880

(no title)

I've been working on several web extractors project, so I think I could share some of my findings while working on them. Granted it's been several months since I worked on it so I might be forgetting some things.

There are several open source projects for extracting web contents. However, there are three extractors that I've worked with and give us good result:

- readability.js[1], web extractor by Mozilla that used in Firefox.

- dom-distiller[2], web extractor by Chromium team, written in Java.

- trafilatura[3], Python package by Adrien Barbaresi from BBAW[4].

First, readability.js, as expected is the most famous extractor. It's a single file Javascript library with modest 2,000+ lines of code, released under Apache license. Since it's in JS, you can use it wherever you want, either in web page using `script` tag or by using it in Node project.

Next, DomDistiller is extractor that used in Chromium. It uses Java language with whopping 14,000+ lines of code and can only be used as part of Chromium browser, so you can't exactly use it as standalone library or CLI.

Finally, Trafilatura is a Python package released under GPLv3 license. Created in order to build a text databases[5] for NLP research, it mainly intended for German web pages. However, as development continues, it works really great with other languages. It's a bit slow though compared to Readability.js.

All of those three work in similar way: extract metadata, remove unneeded contents, and finally returns the cleaned up content. Their differences (that I remembered) are:

- In Readability, they insist to make no special rules for any website, while DomDistiller and Trafilatura give a small exception for popular sites like Wikipedia. Thanks to this, if you use Readability.js in Wikipedia pages, it will shows `[edit]` button thorough the extracted content.

- Readability has a small function to detect whether a web page can be converted to reader mode. While it's not really accurate, it's quite convenient to have.

- In DomDistiller, the metadata extraction is more thorough than the others. It supports OpenGraph, Schema.org, and even the old IE Reading View mark up tags.

- Since DomDistiller is only usable within Chromium, it has the advantage to be able to use CSS styling to determine if an element is important or not. If an element is styled to be invisible (e.g. `display: none`) then it will be deemed unimportant. However, according to a research[6] this step is actually doesn't really affect the extraction result.

- DomDistiller also has an experimental feature to find and extract next page in sites that separated its article to several partial pages.

- For Trafilatura, since it was created for collecting web corpus, it main ability is extracting text and the publication date of a web page. For the latter, they've created a Python package named htmldate[7] whose only purpose is to extract the publication or modification date for a web page.

- Trafilatura also has an experimental feature to remove elements that repeated too often. The idea is if the element occured too often, then it's not important to the reader.

I've found benchmark[8] that compare the performance between the extractors, and it said that Trafilatura has the best accuracy compared to the others. However, before you start rushing to use Trafilatura, you should remember that Trafilatura is intended for gathering web corpus, so it's really great for extracting text content, but IIRC is not as good as Readability.js and DomDistiller for extracting a proper article with images and embedded iframes (depending on how you look, it could be a feature though).

By the way, if you are using Go and need to use a web extractor, I already ported all three of them to Go[9][10][11] including their dependencies[12][13], so have fun with it.

[1]: https://github.com/mozilla/readability

[2]: https://github.com/chromium/dom-distiller

[3]: https://github.com/adbar/trafilatura

[4]: https://www.bbaw.de/en/

[5]: https://www.dwds.de/d/k-web

[6]: https://arxiv.org/abs/1811.03661

[7]: https://github.com/adbar/htmldate

[8]: https://github.com/scrapinghub/article-extraction-benchmark

[9]: https://github.com/go-shiori/go-readability

[10]: https://github.com/markusmobius/go-domdistiller

[11]: https://github.com/markusmobius/go-trafilatura

[12]: https://github.com/markusmobius/go-htmldate

[13]: https://github.com/markusmobius/go-dateparser

discuss

No comments yet.