Full Text RSS Feed: Get the whole feed and nothing but the feed

[+] nicpottier|15 years ago|reply

We built a backend similar to this for our NewsRoom mobile client. (Android and Pre) Actually used some genetic algorithms to do the training for our content extraction, one of the more fun projects I've done.

Word of warning, if it takes off, you basically start turning into someone who is both caching and harvesting the web every 15 minutes. There is an incredibly long tail on RSS feeds and it starts killing you to keep them all up to date. Storing and serving it is no big deal, but harvesting actually turns into real money when you figure out total bandwidth used. (we harvest about ~30,000 feeds every 15 minutes)

[+] tianyicui|15 years ago|reply

I really wish one day Google will sell a page harvesting service. It can certainly profit since the cost is neglectable. But how big the market is?

[+] lurchpop|15 years ago|reply

how do you handle scale like that? Do you have a hadoop cluster or something? How many concurrent do you download?

[+] geuis|15 years ago|reply

Could you darken up on the grey a bit? Grey on white on grey isn't exactly easy to read.

[+] jonkelly|15 years ago|reply

Could be a way to slow down the feed owners' lawyers a bit? ;-)

[+] aaroneous|15 years ago|reply

Is that better?

[+] unknown|15 years ago|reply

[deleted]

[+] timrosenblatt|15 years ago|reply

how's it look now?

[+] timrosenblatt|15 years ago|reply

[deleted]

[+] grayrest|15 years ago|reply

What's he using to pull out the articles? I had a hacky version set up using the Readability algorithm but never bothered to make it public.

[+] yesimahuman|15 years ago|reply

Boilerpipe is by far the best tool for this that I've ever found (http://code.google.com/p/boilerpipe/). I'd be interested to hear if he is using something better, but I'd be surprised if he is.

I think this is a great idea and very similar to a lot of stuff I have worked on recently. It's cool to see so much interest in these text-related services.

[+] beagledude|15 years ago|reply

Goose article extractor has a full suite of unit tests and also does pure text and image extractions: https://github.com/jiminoc/goose

[+] k1m|15 years ago|reply

Possible to do it with Readability - my PHP port is here: http://www.keyvan.net/2010/08/php-readability/ - and similar tool (free software, can be self-hosted) using the PHP Readability here: http://fivefilters.org/content-only/

[+] spidaman|15 years ago|reply

Does this work for anybody? I've plugged in 3 feeds, one was "unable to retrieve full-text content" (an sfgate.com feed) and the other two returned nothing at all in the preview (one a feed from kqed.org, the other an older wordpress blog).

[+] guptaneil|15 years ago|reply

The preview for Lifehacker returned nothing at all, but adding the feed to Google Reader worked as advertised. I guess, don't rely on the preview box.

[+] ericgs|15 years ago|reply

Funny, I just added sfgate too and it worked for me: fulltextrssfeed.com/www.sfgate.com/rss/feeds/news.xml

[+] timrosenblatt|15 years ago|reply

works for me. did you test out the default CNN feed? does that work?

[+] clvv|15 years ago|reply

A similar service: http://fivefilters.org/content-only/ and it is opensource too. It uses a PHP version of readability to extract the full content. Also can the author of fulltextrssfeed.com explain some of the implementation details? I was planning on a similar project with node.js, jsdom and readability.

[+] hokkos|15 years ago|reply

Is it legal ? Can you legally copy all the content of a site and publish it while striping the ads ?

I've tough of this idea since 2 years, but I am so ineffective at building my own ideas that it doesn't surprise me that someone else built it, as the idea was really floating more and more since instapaper mobilizer.

Considering the legal aspect I had more ideas about that. It is to hide behind the DMCA takedown, and provide an email address to take-down a feed. But do not map the www.example.com/feed.xml to http://fulltextrssfeed.com/www.example.com/feed.xml , but use an alias, so the take-down just remove the alias not the whole * .example.com*.

[+] swombat|15 years ago|reply

Immediate swap of current PG essays feed for:

http://fulltextrssfeed.com/www.aaronsw.com/2002/feeds/pgessa...

[+] yahelc|15 years ago|reply

Considering the impending lawyer-takedown, it would be great if this was made open source, so people can implement their own local versions on their own servers.

[+] yagibear|15 years ago|reply

Could you also do the opposite: Take bulky feeds (e.g. http://feeds.feedburner.com/tedblog) and truncate them; showing title & first para & include a link? I use RSS primarily to scan what is available and mark some for later reading, and bulky feeds interrupt the scanning process.

[+] baddox|15 years ago|reply

This would probably be trivial with Yahoo! Pipes.

http://pipes.yahoo.com/pipes/

[+] timrosenblatt|15 years ago|reply

that's a good idea. I will bring it up to him.

[+] cvandyck76|15 years ago|reply

Is there an argument to be made that the content providers only get 'paid' if the RSS reader is enticed to click through to the site? I'm all for neat services, but I think that this is a little bit unfair to the other party.

[+] tuhin|15 years ago|reply

Not trying to be the show stopper here, but this is illegal right? I mean especially news sites like Reuters do create a fuss when this is done. Is that (legal drama) only in commercial projects or otherwise too?

[+] netmau5|15 years ago|reply

Nice, this will come in very useful for an RSS-based project I'm working on too. Hopefully I won't slam your servers too hard. Are you considering making the source available?

[+] aaroneous|15 years ago|reply

I wasn't expecting much interest in it, but I'd be happy to clean it and package it up if you guys want to play.

[+] timrosenblatt|15 years ago|reply

it's not mine, it's a project a friend threw together over the weekend. It's on a shared host, but I'm trying to help light the server on fire so he puts it on something more heavy duty. :)

[+] pak|15 years ago|reply

This is nice but what's the difference from ViewText (http://www.viewtext.org)? ViewText has a JSONP API, which made it perfect for building into a recent little project I did (it was a web app). Plus, it's been around for a lot longer.

[+] ericgs|15 years ago|reply

Just tried viewtext on lifehacker's feed and got:

"We understand you'd like to delete your account. If you delete your account all of your information including your comments, messages, posts, and friends and followers associations will be removed from our system. Please consider the following options before clicking delete."

Yikes! =X

[+] unknown|15 years ago|reply

[deleted]

[+] AdamGibbins|15 years ago|reply

Excellent thanks, shall be applying this to all my Gawker feeds.

[+] yahelc|15 years ago|reply

You know they make a full-feed version available of all their sites, right? It's just of the form http://lifehacker.com/vip.xml

[+] gnosis|15 years ago|reply

##sigh##

Yet another service that requires me to hand over information on what I read.

Why couldn't this be made as a privacy-respecting application I can run from my own machine?

[+] roll|15 years ago|reply

interesting project. I was doing a similar thing with yahoo pipes, but it got blocked because of robots.txt. What do you do about it?

[+] shadowpwner|15 years ago|reply

You can always disregard robots.txt.. ;)

[+] palak55|15 years ago|reply

The same service is offer by www.getrss.in and i am happy client of them for more then 8 months

[+] palak55|15 years ago|reply

http://www.getrss.in

[+] timrosenblatt|15 years ago|reply

Nice. Grabs the whole article text so you don't have to leave the RSS reader.

[+] timrosenblatt|15 years ago|reply

Lol: http://fulltextrssfeed.com/news.ycombinator.com/rss

Works well for keeping up with HN too :)

[+] dholowiski|15 years ago|reply

The content thieves will love it too. This makes it much easier to automatically copy content.

[+] adrianwaj|15 years ago|reply

Is there a time delay between the source feed and the full feed?

72 comments