Show HN: English and Spanish news articles summarizer algorithm with word clouds

[+] btutal|7 years ago|reply

Good job man, have you ever considered taking it few steps further.

I would really appreciate, if an application would go through my RSS feeds and offer me neutral news (without any comments or etc.) only facts in summary.

Imagine you have 15 different news from 15 different sources about same topic. Let's say "Microsoft's new Chromium-Edge Browser" Each tech site is writing about it from their perspective. Some say it is quite cool, some say it is just a Chrome clone. I would appreciate a summary of this 15 web site without additional comments.

What do you think?

[+] theblackcat1002|7 years ago|reply

I wrote a news website[https://todayheadlines.live/] which show news with similar topic in a cluster form. My pipeline do collect content data as well. However, I am not very sure how to properly summarized all the perspectives.

[+] Agent_Phantom|7 years ago|reply

I actually really like your idea and it can be implemented very quickly.

This project is currently run for the subreddit of my country and the users have liked it a lot, the summaries often remove the bias and keep the facts.

I can make a subproject that will load rhe urls from a rss feed and create shorter summaries. Thankfully I would recycle 90% of the codebase.

[+] guybedo|7 years ago|reply

quite interesting idea, it's on my todolist for Aktu, a rss reader / news aggregator i built (https://aktu.io/about). For now Aktu groups together articles in your feeds that are about the same stories, to you can easily check other sources perspectives. But it misses the summary/facts.

[+] gandhium|7 years ago|reply

[deleted]

[+] giancarlostoro|7 years ago|reply

Man, I thought word clouds were gone. I remember the word cloud craze in the mid to late 2000's then they sorta vanished. I guess other SEO enhancements replaced them?

[+] Agent_Phantom|7 years ago|reply

Indeed, they are not as popular now but I thought they looked cool.

Fortunately the effort to implement them was very low since I reused an internal variable.

[+] Theodores|7 years ago|reply

This is really good. Right now I am doing a bit of sentence parsing myself and I appreciate the time you have put in to documenting your algorithms as well as the tools used.

I am interested in a few metrics such as sentence length to flag run-on sentences that are not good advertising copy. After reading your article I am wondering what else I need to be doing since I am working at the word level.

I remember Microsoft Word had tools in it to gauge reading level - do you know if there is a convenient library for that in Python world? I am not using Python myself but there is a difference between a tabloid and a broadsheet, maybe you could put that into the mix.

[+] Agent_Phantom|7 years ago|reply

I think this library does exactly what you need:

https://github.com/shivam5992/textstat

[+] Agent_Phantom|7 years ago|reply

On the following url you can see the final result after an article is processed:

https://www.reddit.com/user/huachibot/comments/

The good part is that the summary algorithm is independent from the bot logic. It processes the text no matter how you obtained it.

11 comments