Diffbot's stuff is in a different league (but it's a hosted service with a large dataset) but if anyone's vaguely interested in this area, I've been working on a Ruby library that performs some similar features: https://github.com/peterc/pismo
It is currently undergoing the "big rewrite" (which includes some proper classification work rather than shooting in the dark), however, but it's still in daily use on several sites. Hopefully I can learn a few lessons from Diffbot!
I should also point out BoilerPlate - http://code.google.com/p/boilerpipe/ - an interesting Java based content extraction project that's being worked on an by an actual PhD student rather than a dilettante like me ;-) Again, Diffbot's stuff goes a lot further than this but there are lessons to be learned nonetheless.
Last but not least, a paper by the aforementioned PhD student called Boilerplate Detection using Shallow Text Features is available at http://www.l3s.de/~kohlschuetter/boilerplate/
I suspect that there's going to be a lot more work in these areas in the medium term because of the growth of the "e-discovery" market and because the dreams of a consistently marked up "semantic Web" have been washing down the pan for a while now.
Cool, works better than other services like this I've found. I tried it on Ars Technica's review of the Xoom tablet and it found all 10 pages. It didn't find the embedded video though. Also, all the formatting is stripped which makes it hard to differentiate section headers from content paragraphs, and all the images are in one list to the side, removed from their original context.
What I'd really love to see is a combination of the RSS API and the article API to produce full article RSS feeds for any site.
Right now, the API just returns back the raw text for simplicity's sake, but it would be possible to make an option for returning a bit of HTML structure, which would address the problem of sections, inline images, tables, etc.
Great suggestion, you can use the article API already to do this (providing the URL of the post and the 'tag' parameter), but maybe you are thinking of analyzing other types of content? Or perhaps the ability to POST your own text data instead of just the URL?
This looks great. I'd love to find out more about your API and what type of web scraping techniques you're using. It looks like this is going to be available publicly to developers? What type of usage do you guys allow?
Article content starts out in a straightforward, easy-to-process form, as created by the reporter/author, in a content management system. Then the CMS chops it up into pages and adds boilerplate for presentation as a web page. Then you expend lots of effort to stick the pages back together and filter the crap back out, to arrive at an approximation of the original. Generally a noisy, imperfect approximation that is less useful for your purposes (indexing, information extraction, etc).
If technical considerations were the only considerations, we would find a way to get at the content directly instead of using this Rube Goldberg mechanism. But of course there are also economic considerations. Content owners don't want to give you unadulterated content for free; their business model requires that ads be served along with it.
Will an arms race develop between scrapers and publishers, similar to the arms race between spammers and spam filters? Will publishers start randomizing their HTML generation, or otherwise making it difficult to separate content from peripheral material?
I like the "machine learning" part of the api, but there seems to be no way of improving the learning by giving feedback.
Just testing another article on HN[1] that the tags are pretty far off. I expected iPad 2 and photos/pixels but i got 4G and manufacturing instead[2]. So I am really interested in how the system came up with the right and wrong tags (which I guess sound more important than find the body of the article, as people are making that easier for facebook and others through Open Graph/RDFa/hNews etc.)
It's fairly CPU intensive. Like many classification-based techniques, much of the computation is in constructing the features. For our case, this means we had to implement most of CSS to get the visual features of every element on the page.
Duh... for the pages I tried, it always gives me the "No article at this URL".
I'd like to have some kind of boilerplate removal that works well for forum content (e.g. phpBB and related), and boilerpipe (the library that I tried) gives relatively mixed results.
We have a separate API for non-article pages (It's called the Follow API). It's not well documented yet, but you can get an idea for it from this demo: http://www.diffbot.com/mobilizer which will turn any webpage into a mobile version. Try putting in http://techcrunch.com or your forum thread page.
[+] [-] petercooper|15 years ago|reply
It is currently undergoing the "big rewrite" (which includes some proper classification work rather than shooting in the dark), however, but it's still in daily use on several sites. Hopefully I can learn a few lessons from Diffbot!
I should also point out BoilerPlate - http://code.google.com/p/boilerpipe/ - an interesting Java based content extraction project that's being worked on an by an actual PhD student rather than a dilettante like me ;-) Again, Diffbot's stuff goes a lot further than this but there are lessons to be learned nonetheless.
Last but not least, a paper by the aforementioned PhD student called Boilerplate Detection using Shallow Text Features is available at http://www.l3s.de/~kohlschuetter/boilerplate/
I suspect that there's going to be a lot more work in these areas in the medium term because of the growth of the "e-discovery" market and because the dreams of a consistently marked up "semantic Web" have been washing down the pan for a while now.
[+] [-] itsnotvalid|15 years ago|reply
[+] [-] alok-g|15 years ago|reply
[+] [-] modeless|15 years ago|reply
What I'd really love to see is a combination of the RSS API and the article API to produce full article RSS feeds for any site.
[+] [-] miket|15 years ago|reply
The combination of the two APIs is a great idea.
[+] [-] tansey|15 years ago|reply
You guys should really open up a tagging API. As a developer working on a social site, I'd love to be able to auto-tag content that users upload.
[+] [-] miket|15 years ago|reply
[+] [-] quan|15 years ago|reply
[+] [-] miket|15 years ago|reply
[+] [-] jranck|15 years ago|reply
[+] [-] miket|15 years ago|reply
[+] [-] aaronkaplan|15 years ago|reply
If technical considerations were the only considerations, we would find a way to get at the content directly instead of using this Rube Goldberg mechanism. But of course there are also economic considerations. Content owners don't want to give you unadulterated content for free; their business model requires that ads be served along with it.
Will an arms race develop between scrapers and publishers, similar to the arms race between spammers and spam filters? Will publishers start randomizing their HTML generation, or otherwise making it difficult to separate content from peripheral material?
[+] [-] itsnotvalid|15 years ago|reply
Just testing another article on HN[1] that the tags are pretty far off. I expected iPad 2 and photos/pixels but i got 4G and manufacturing instead[2]. So I am really interested in how the system came up with the right and wrong tags (which I guess sound more important than find the body of the article, as people are making that easier for facebook and others through Open Graph/RDFa/hNews etc.)
[1]: http://daringfireball.net/2011/03/bending_over_backwards
[2]: tags received: Recyclable materials, Battery, 4G, Apple Inc., Rechargeable battery, Walter Mossberg, Technology, Computing, Manufacturing, Technology_Internet
[+] [-] mcfchan|15 years ago|reply
[+] [-] shantanubala|15 years ago|reply
[+] [-] miket|15 years ago|reply
[+] [-] ronnoch|15 years ago|reply
[+] [-] sqrt17|15 years ago|reply
I'd like to have some kind of boilerplate removal that works well for forum content (e.g. phpBB and related), and boilerpipe (the library that I tried) gives relatively mixed results.
Does anyone know an existing solution for this?
[+] [-] miket|15 years ago|reply
[+] [-] buss|15 years ago|reply
[+] [-] alexdong|15 years ago|reply
[+] [-] Mamady|15 years ago|reply
Get it working there and you will have a lot more consumers.
[+] [-] miket|15 years ago|reply
[+] [-] dfgonzalez|15 years ago|reply
Thanks
[+] [-] MrVitaliy|15 years ago|reply
[+] [-] normaluser|15 years ago|reply
[+] [-] bitanarch|15 years ago|reply
[+] [-] kevingao1|15 years ago|reply
[+] [-] atse|15 years ago|reply