Show HN: Readability-like API Using Machine Learning

[+] petercooper|15 years ago|reply

Diffbot's stuff is in a different league (but it's a hosted service with a large dataset) but if anyone's vaguely interested in this area, I've been working on a Ruby library that performs some similar features: https://github.com/peterc/pismo

It is currently undergoing the "big rewrite" (which includes some proper classification work rather than shooting in the dark), however, but it's still in daily use on several sites. Hopefully I can learn a few lessons from Diffbot!

I should also point out BoilerPlate - http://code.google.com/p/boilerpipe/ - an interesting Java based content extraction project that's being worked on an by an actual PhD student rather than a dilettante like me ;-) Again, Diffbot's stuff goes a lot further than this but there are lessons to be learned nonetheless.

Last but not least, a paper by the aforementioned PhD student called Boilerplate Detection using Shallow Text Features is available at http://www.l3s.de/~kohlschuetter/boilerplate/

I suspect that there's going to be a lot more work in these areas in the medium term because of the growth of the "e-discovery" market and because the dreams of a consistently marked up "semantic Web" have been washing down the pan for a while now.

[+] itsnotvalid|15 years ago|reply

That is good stuff. How is http://coder.io using these?

[+] alok-g|15 years ago|reply

Are some examples or online trial available for pismo?

[+] modeless|15 years ago|reply

Cool, works better than other services like this I've found. I tried it on Ars Technica's review of the Xoom tablet and it found all 10 pages. It didn't find the embedded video though. Also, all the formatting is stripped which makes it hard to differentiate section headers from content paragraphs, and all the images are in one list to the side, removed from their original context.

What I'd really love to see is a combination of the RSS API and the article API to produce full article RSS feeds for any site.

[+] miket|15 years ago|reply

Right now, the API just returns back the raw text for simplicity's sake, but it would be possible to make an option for returning a bit of HTML structure, which would address the problem of sections, inline images, tables, etc.

The combination of the two APIs is a great idea.

[+] tansey|15 years ago|reply

I really like this.

You guys should really open up a tagging API. As a developer working on a social site, I'd love to be able to auto-tag content that users upload.

[+] miket|15 years ago|reply

Great suggestion, you can use the article API already to do this (providing the URL of the post and the 'tag' parameter), but maybe you are thinking of analyzing other types of content? Or perhaps the ability to POST your own text data instead of just the URL?

[+] quan|15 years ago|reply

It looks like something I would integrate for my current project but your term suggests that the api is only for personal and non-commercial uses.

[+] miket|15 years ago|reply

Not at all, feel free to use it for commercial uses. I'm removing that from the terms.

[+] jranck|15 years ago|reply

This looks great. I'd love to find out more about your API and what type of web scraping techniques you're using. It looks like this is going to be available publicly to developers? What type of usage do you guys allow?

[+] miket|15 years ago|reply

I'd be happy to talk about the visual-based statistical classification technology in a follow-up blog post if there's interest.

[+] aaronkaplan|15 years ago|reply

Article content starts out in a straightforward, easy-to-process form, as created by the reporter/author, in a content management system. Then the CMS chops it up into pages and adds boilerplate for presentation as a web page. Then you expend lots of effort to stick the pages back together and filter the crap back out, to arrive at an approximation of the original. Generally a noisy, imperfect approximation that is less useful for your purposes (indexing, information extraction, etc).

If technical considerations were the only considerations, we would find a way to get at the content directly instead of using this Rube Goldberg mechanism. But of course there are also economic considerations. Content owners don't want to give you unadulterated content for free; their business model requires that ads be served along with it.

Will an arms race develop between scrapers and publishers, similar to the arms race between spammers and spam filters? Will publishers start randomizing their HTML generation, or otherwise making it difficult to separate content from peripheral material?

[+] itsnotvalid|15 years ago|reply

I like the "machine learning" part of the api, but there seems to be no way of improving the learning by giving feedback.

Just testing another article on HN[1] that the tags are pretty far off. I expected iPad 2 and photos/pixels but i got 4G and manufacturing instead[2]. So I am really interested in how the system came up with the right and wrong tags (which I guess sound more important than find the body of the article, as people are making that easier for facebook and others through Open Graph/RDFa/hNews etc.)

[1]: http://daringfireball.net/2011/03/bending_over_backwards

[2]: tags received: Recyclable materials, Battery, 4G, Apple Inc., Rechargeable battery, Walter Mossberg, Technology, Computing, Manufacturing, Technology_Internet

[+] mcfchan|15 years ago|reply

That's a really nifty API. Performance is nice too. Would be interested in knowing more about it.

[+] shantanubala|15 years ago|reply

This is fantastic! How resource-intense is it to run? Machine Learning, depending on the implementation, can be pretty costly from what I understand.

[+] miket|15 years ago|reply

It's fairly CPU intensive. Like many classification-based techniques, much of the computation is in constructing the features. For our case, this means we had to implement most of CSS to get the visual features of every element on the page.

[+] ronnoch|15 years ago|reply

The tagging feature is very impressive.

[+] sqrt17|15 years ago|reply

Duh... for the pages I tried, it always gives me the "No article at this URL".

I'd like to have some kind of boilerplate removal that works well for forum content (e.g. phpBB and related), and boilerpipe (the library that I tried) gives relatively mixed results.

Does anyone know an existing solution for this?

[+] miket|15 years ago|reply

We have a separate API for non-article pages (It's called the Follow API). It's not well documented yet, but you can get an idea for it from this demo: http://www.diffbot.com/mobilizer which will turn any webpage into a mobile version. Try putting in http://techcrunch.com or your forum thread page.

[+] buss|15 years ago|reply

I noticed that this will happen for NYT articles behind their registration-wall.

[+] alexdong|15 years ago|reply

We are using http://purifyr.com/ for this. Pretty happy with the unicode support and 20-50 documents per second speed.

[+] Mamady|15 years ago|reply

Looks good but doesn't work for wikipedia.

Get it working there and you will have a lot more consumers.

[+] miket|15 years ago|reply

I tried a couple wikipedia pages and it seemed to work ok. Can you email me an example?

[+] dfgonzalez|15 years ago|reply

I sent a token request almost a week ago, do you send positive and negative answers on this?

49 comments